doc-textify
v1.0.2
Published
A Node.js library to extract text from office documents (docx, pptx, xlsx, odt, odp, ods, pdf, text, html ...)
Maintainers
Readme
Doc-Textify
Doc-Textify is a TypeScript library and command-line tool that extracts and cleans text from various document formats.
🚀 Features
Multi-format support:
- Microsoft Word (
.docx) - PowerPoint (
.pptx) - Excel (
.xlsx) - OpenOffice/LibreOffice (
.odt,.odp,.ods) - PDF (
.pdf) - Plain text (
.txt) - HTML (
.html,.htm)
- Microsoft Word (
Content cleaning: removes extra whitespace, handles custom line delimiters.
Configurable options: set newline delimiter, minimum characters to extract, and toggle error logging.
📦 Library Usage
Install the package and import it in your project:
npm install doc-textify --saveimport { docTextify } from 'doc-textify'
// async/await version
try {
const text = await docTextify('path/to/file.pdf')
} catch (e) {
console.error(err)
}
// or callback version
docTextify('path/to/file.pdf')
.then(text => console.log(text))
.catch(err => console.error(err))Default options:
try {
const text = await docTextify('path/to/file.pdf', {
newlineDelimiter: '\n', // output content delimiter
minCharsToExtract: 0, // number of chars required to output the content, default disabled (0)
outputErrorToConsole: true // log error to console
})
} catch (e) {
console.error(err)
}🚀 CLI Usage (Optional)
If you prefer a ready-made command, the doc-textify CLI wraps the same functionality:
Installation
Global install to use the doc-textify command anywhere:
npm install -g doc-textifyOr install locally:
npm install doc-textify --saveCommand
doc-textify <path/to/document> [options]Options
| Option | Description | Default | | ---------------------------------------------------- | --------------------------------------- |---------------------------| | -n, --newlineDelimiter | Line delimiter to insert | "\n" | | -m, --minCharsToExtract | Minimum number of characters to extract | 0 (disabled) | | -h, --help | Display help message | — |
Example
doc-textify document.docx -n "\r\n" -m 20 > output.txt📥 Installation from Source
git clone https://github.com/johaven/doc-textify.git
cd doc-textify
npm install
npm run build # outputs compiled files into /dist
npm run test # test parsing🤝 Contributing
- Fork the repository
- Create a branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m "Add my feature" - Push to your branch:
git push origin feature/my-feature - Open a Pull Request
📄 License
This project is licensed under the MIT License.
