npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

cloud-vision-lines-phrases-parser

v1.0.1

Published

Uses customizable parsers to find text located within OCR output.

Downloads

7

Readme

cloud-vision-lines-phrases-parser

Uses customizable parsers to find text located within OCR output. In this case, the OCR output is the line list object generated by the cloud-vision-lines-phrases package, which itself is a modified version of the output from Google Cloud Vision's small batch file annotation online.

Check out a UI demonstration of the module here as well as in CodeSandbox.

Installation

const { getParsersTarget } = require('cloud-vision-lines-phrases-parser')

Use

//Refer to the cloud-vision-lines-phrases package for further detail on this setup for generating the lineList object
const getAnnotationFormats = require('cloud-vision-lines-phrases')

const batchFileAnnotation = ['...'] 
const bucketFileBasename = 'filename'
const annotationFormats = getAnnotationFormats(batchFileAnnotation, bucketFileBasename);
const lineList = annotationFormats.lineList
const parsers = [

  //parser #1 
  {
    count: 2  //any number
    method: 'after' //'after' or 'below',
    target: { 
      {
        pattern: '.+',  //use any pattern that you would input into a RegExp object's constructor function 
        unit: 'phrase',  //'phrase' or 'word'
      },
    },
  },

  //parser #2
  {
    count: 2  
    method: 'below' 
    target: {
      {
        pattern: '\\S{8}',  
        unit: 'word'  
      },
    },
  },

  //parser #3, etc.
]

The parsers array in the above example performs the following steps:

  1. The first parser finds the <2nd> <phrase> <after> the <beginning of the document> that matches the pattern: <.+>

  2. The second parser finds the <2nd> <word> <below> the <previously parsed value> that matches the pattern: <.{8}>

//Now we can input the lineList and parsers array into the package's function
const targetTextObject = getParsersTarget(lineList, parsers)

And the object that is returned contains info about the text value, location within the line list, the bounding box coordinates, and its unit type:

{
  value: '2/1/2023',
  normalizedVertices: ['...'],
  indices: {
    pageEnd: 0, lineEnd: 3, phraseEnd: 1, wordEnd: 0
  },
  unitType: 'word', 
}

Looking at the image file (assume this snippet is the entire document), the first parser captures the yellow-highlighted value, and the second parser (starting its parse from this value) captures the blue-highlighted value which is ultimately the value that is returned:

Alt text

More details about how the units are defined (phrases and words) can be found in the cloud-vision-lines-phrases package.

When selecting the after method, think of this as parsing left to right across the page, and then down to the next line, just as you would read a page of text. More technically, the parser will iterate through one of the two unit types (either phrase or word per the target.unit) until it finds a unit's text that matches the target.pattern. The count determines how many matching units it must find, and the last matching unit will be the returned value and the point where the parse terminates. If the count has not been met by the time the parser reaches the end of the line list, an empty value will be returned. Each parser will start at the unit where the previous parser ended, and the first parser will always start at the beginning of the line list (this applies to both the after and below methods, so it doesn't matter which of the two methods you select for the first parser).

CAVEAT: If a parser is capturing a word (yellow) that is embedded in the beginning or middle of a phrase, the next parser (if the selected target.pattern is a phrase) will use the remainder of that phrase (blue) in its count :

Alt text

When the below method is selected, the parser will go line by line directly beneath the previous parser's stopping point until it finds a unit (either a phrase or word depending on the target.unit) with text matching the target.pattern. Directly beneath, in this context, means that a unit is in a line beneath, and horizontally overlaps, the previous parser's stopping point. When a matching unit is found, the count is incremented and the parse continues to go down to the next line directly beneath the previous parser's stopping point and looks for another matching unit. This continues until either the count is met, which will return the last count's value, or the end of the line list is reached, which will return an empty value. The positions of the units stay relative to each other across multiple pages, i.e. text on subsequent pages is considered to be directly beneath text on prior pages, as long as their x-coordinates overlap.