@gmr-fms/word-extractor

v0.3.0

Published

4 years ago

Node.js package to read Word .doc files

0High
0Medium
0Low

Forked from morungos/node-word-extractor

What's different?
I needed buffer support but didn't want to deal with coffeescript so I modified the repo a bit. The main public change is the module is now an object with two methods fromFile and fromBuffer. I also removed 'bluebird' so the returned promises are native.

word-extractor

Read data from a Word document using node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they all appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the node.js environment.

How do I install this module?

yarn add @gmr-fms/word-extractor

# Or using npm...
npm install @gmr-fms/word-extractor

How do I use this module?

const extract = require('word-extractor');
extract.fromFile('file.doc').then(doc => {
  console.log(doc.getBody());
});

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

extract#fromFile(filePath) => Promise<Document>
extract#fromBuffer(buf) => Promise<Document>

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders()

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

License

Licensed under the MIT License.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme