npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@briancullen/aws-textract-parser

v0.0.2

Published

Library for converting AWS Textract responses into a more usable structure.

Downloads

4,551

Readme

AWS Textract Parser

Build Status Maintainability Test Coverage

Textract is an AWS service that lets you extract text from pictures or PDF documents. This library was created to process the the response from that service and transform it into something a little more manageable.

NOTE: Currently this library is only setup to deal with responses from the DetectDocumentText calls, either synchronous or asynchronous. Parsing the calls that analyse documents may be added at a later date.

Rationale

Textract returns json representing the pages, lines and words it has detected in the input. Below is a simplified example of the data you could expect for a single line of text consisting of two words. As you can see the data describes a tree where the line is identified as a child of the page and the words as children of the line.

{
  "DocumentMetaData": {
    "Pages": 1
  },
  "Blocks": [
    {
      "Id": "1",
      "BlockType": "PAGE",
      "Relationships": [{
        "Type": "CHILD",
        "Ids": [ "2" ]
      }]
    },
    { 
      "Id": "1",
      "BlockType": "LINE",
      "Relationships": [{
        "Type": "CHILD",
        "Ids": [ "3", "4" ]
      }]
    },
    { "Id": "3", "BlockType": "WORD" },
    { "Id": "4", "BlockType": "WORD" }
  ]
}

Unfortunately this tree structure is flattened into a array which makes navigating it more awkward that it should be. The purpose of this library is to process this flattened json to provide the tree structure described by it.

In some tests the order of the words related to a line did not match that of the text. This is not what you would expect from processing a document. To address this the library will sort the words into left to right order (based on their position on the page).

Usage

The default export from the module is a parser instance that supports three different methods, handleDetectTextCallback, handleDetectTextResponse, and parseGetTextDetection.

handleDetectTextCallback is a helper method that can be passed in as the standard callback to the Textract method. In turn it will call another callback with the processed tree. An example of this type of usage is shown below.

import { Textract } from 'aws-sdk'
import textractParser from '<TBD>'

const textract = new Textract()
const myCallback = (err, data) => {
  if(err) {
    console.log(err)
  } else {
    console.log(data)
  }
}

const request = {
  Document: {
    S3Object: {
      Bucket: "your-s3-bucket",
      Name: "your-object-key"
    }
  }
}

textract.detectDocumentText(request,
  textractParser.handleDetectTextCallback(myCallback))

handleDetectTextResponse will take a value of type Textract.DetectDocumentTextResponse and process it synchronously. This can be used with the promises provided by the AWS SDK. An example of how to use it in this manner is shown below.

textract.detectDocumentText(request).promise()
  .then(data => textractParser.parseDetectTextResponse(data))
  .then(parsedData => console.log(parsedData))
  .catch(err => console.log(err))

parseGetTextDetection is a helper method to be used with the GetDocumentTextDetection operation. This operation can return the processed information over multiple requests which causes a problem when trying to construct the complete tree. If all the results are returned in a single response then the handleDetectTextResponse can be used as shown above.

However, if that is not the case, then this call can be used to retrieve all the data and construct the tree as shown below. To allow the SDK to be configured differently in different environments a instantiated Textract client must be provided to this method.

const jobId = 'your-job-id'
const client = new AWS.Textract()

textract.detectDocumentText(client, jobId)
 .then(parsedData => console.log(parsedData))
 .catch(err => console.log(err))

NOTE This method will load the entire set of results into memory which may cause issues for really large documents. To give some context for a 10 page document of text the size of the results returned from textract was in the region of 7MB.

API

See the API Docs for more information.

In particular refer to the API for the Document class as this forms the root of the tree that is returned.