content-extractor
v0.1.0
Published
Takes a text file and splits it into 2 files based on a filter
Downloads
5
Readme
Content Extractor
Node module that takes a text file and splits it into 2 files based on a filter. The split is done at the level of an individual line in the file.
Usage
yarn add content-extractor
Example
Filtering a text file based on a alpha filter. The filter will be fed the text file a line at a time and make a decision about if it matches the criteria or not.
const languageExtractor = require('content-extractor');
const filter = languageExtractor.containsOnlyAlphaCharacters;
languageExtractor.extractContent(
'input.txt',
'alpha.txt',
'everythingElse.txt',
filter
);
Expected Input and Output
Input:
input.txt
ABC
!!!
Apply containsOnlyAlphaCharacters
filter
Output:
alpha.txt
ABC
everythingElse.txt
!!!
Providing your own filter
A filter is a function that takes a string and returns a boolean. Typescript example:
const numberRegEx: RegExp = new RegExp(/[0-9]/, 'g');
const numberFilter = (s: string): boolean => {
const chars = s.match(numberRegEx) || [];
return chars.length === s.length;
};
Provided Filters
Persian Filter
The Persian language filter is based on the Unicode Range for Arabic Script. The API will not be able to filter on different languages if they both use Arabic characters
That a single line would not be expected to contain a mix of Persian and Non Persian and that the presence of Persian characters is an indicator of the line being in Persian
That a line that contains only numbers (Hindu-Arabic numerals) cannot be assumed to be Persian.
That latin punctuation used in Persian, e.g. full stops, should not be used as an indicator of Persian
Alpha Filter
- Extracts text that contains only the characters a-z. It is not case sensitive and rejects text that contains whitespace.