@contracthero/textract-to-markdown
v1.0.2
Published
## Textract JSON to Markdown Converter
Downloads
1,801
Readme
textract-to-markdown
Textract JSON to Markdown Converter
A TypeScript project that converts AWS Textract JSON output into formatted Markdown. It supports text extraction from lines, pages, and tables. It uses an heuristic approach to keep the logical reading order of multi column documents intact.
Getting Started
1. Install Dependencies
npm install2. Run Tests
npm test3. Run the Converter
To convert a Textract JSON file to Markdown, run:
node dist/app.jsFeatures
- Converts LINE and PAGE blocks to Markdown text.
- Handles TABLE blocks with proper alignment and formatting.
- Comprehensive test coverage with Jest.
- Example files for testing a variety of documents.
Known Issues/ Limitations
- header and footer are not handeled very well yet
- text lines containing a form and thereby split into multiple textract lines could be interpreted as seperate columns
- creates new column if line x position deviates more then 25% of the page width from other lines. Not that robust if dealing if centered lines of vastly different lenths or inline forms that are detected by textract as different lines.
Example Output
Input (example.json):
{
"Blocks": [
{ "BlockType": "PAGE", "Id": "1" },
{ "BlockType": "LINE", "Id": "2", "Text": "Hello World" },
{
"BlockType": "TABLE",
"Id": "3",
"Relationships": [{ "Type": "CHILD", "Ids": ["4", "5"] }],
"Cells": [
{ "Id": "4", "Text": "Header 1", "RowIndex": 1, "ColumnIndex": 1 },
{ "Id": "5", "Text": "Header 2", "RowIndex": 1, "ColumnIndex": 2 }
]
}
]
}Output (Markdown):
# Page 1
Hello World
| Header 1 | Header 2 |
| -------- | -------- |