docling-book-translator
v0.2.2
Published
Local-first CLI to convert English PDF books to Portuguese (Brazil) EPUB files using Docling and Hugging Face.
Maintainers
Readme
Docling Book Translator
Local-first CLI to convert PDF books in English to Portuguese (Brazil) EPUB files, using Docling and an open-source Neural Machine Translation model from Hugging Face.
The goal is to provide an end-to-end pipeline that runs on a typical desktop machine (CPU only), without paid APIs, and with a simple interactive workflow:
PDF in the current folder → answer a few questions → get a translated EPUB ready for Kindle.
Features
- PDF → structured text using Docling (layout-aware parsing).
- English → Brazilian Portuguese translation via a Marian NMT model.
- EPUB generation from translated Markdown.
- Interactive CLI:
- Asks for the input PDF (or auto-detects a PDF in the current folder).
- Asks where to save the output (default: same folder as the PDF).
- Asks for EPUB title and author (with sensible defaults).
- Shows a progress bar during translation.
- CPU-only by default; no paid APIs or external cloud services.
Requirements
- Python 3.10+ (tested with 3.13).
- A machine with at least:
- Quad-core CPU
- 16 GB RAM recommended (32 GB preferred) for large PDFs.
- Internet only for the first run to download:
- Docling models,
- Hugging Face translation model:
Helsinki-NLP/opus-mt-tc-big-en-pt
All models are cached locally (Hugging Face cache and Docling artifacts), so subsequent runs can be offline.
Installation
Clone the repository:
git clone https://github.com/EduardoXavier16/docling-book-translator.git
cd docling-book-translatorCreate and activate a virtual environment (recommended):
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/macOSInstall dependencies:
python -m pip install --upgrade pip
python -m pip install -r requirements.txtInstall via npm / npx
After publishing the package to npm, you (or other users) can run the CLI directly with npx (without cloning):
npx docling-book-translatorOr install it globally:
npm install -g docling-book-translator
docling-book-translatorBoth commands invoke the Node wrapper, which:
- Verifies that a Python 3 executable is available (
pythonorpython3, or the value of theDBT_PYTHONenvironment variable). - Delegates the interactive workflow to the underlying
cli.pyscript.
Usage (Python CLI)
Open a terminal in a folder that contains at least one PDF, for example:
cd C:\books\qa-testingThen run the CLI from the project:
cd C:\projects\docling-translater
python cli.pyThe CLI will:
- Ask for the input PDF:
- If you press ENTER and there is a single
.pdfin the current folder, it will use that file.
- If you press ENTER and there is a single
- Ask for the output folder:
- ENTER = same folder as the PDF.
- Ask for title and author:
- ENTER =
<PDF name> (PT-BR)andUnknown.
- ENTER =
- Show a short summary and ask for confirmation.
- Run the full pipeline:
- PDF → Docling →
document.md - Translation (streaming) →
document_translated.md - Export →
book_translated.epub
- PDF → Docling →
The final EPUB will be in:
<output-folder>/<book-id>/book_translated.epubWhere <book-id> defaults to the PDF file name without extension.
CLI design (v2 – streaming markdown)
The entry point is cli.py, which orchestrates the pipeline in three steps:
- PDF preparation –
prepare_pdf.py- Uses Docling to convert the PDF into
document.md(plus HTML/JSON artifacts).
- Uses Docling to convert the PDF into
- Translation (streaming) –
translation.py- Reads
document.mdblock by block and writes the translated text todocument_translated.mdas it goes (no giant in‑memory buffer).
- Reads
- EPUB export –
export_epub.py- Converts
document_translated.mdintobook_translated.epub.
- Converts
Internally, some legacy scripts (like export_translated.py) may still exist
for experimentation, but the official CLI v2 flow is:
document.md→document_translated.md→book_translated.epub
Planned npm package
The long-term goal is to publish a Node.js wrapper on npm, named:
- Package:
docling-book-translator - CLI command:
npx docling-book-translator
The npm CLI would:
- Ask the same questions as
cli.py(input PDF, output folder, title, author). - Internally call the Python CLI, ensuring that:
- Python is installed,
- This project (and its dependencies) are available.
This repository contains the Python core. A separate TypeScript/Node wrapper
can import or shell out to cli.py to expose the same UX on npm.
Licensing and third‑party components
- This project is released under the MIT License (see LICENSE).
- It uses third‑party components:
- Docling – MIT License.
- Hugging Face models (e.g.
Helsinki-NLP/opus-mt-tc-big-en-pt) – see each model’s page on Hugging Face for specific license terms. transformers,sentencepiece,ebooklib,markdown,tqdm, etc.
When publishing to npm or deploying in production, make sure that your usage of these components complies with their respective licenses.
Limitations
- Currently focuses on English → Brazilian Portuguese translation.
- Images and figures from the original PDF are not preserved in the EPUB. The output is text‑only.
- Some very complex PDFs (heavy graphics, scanned pages, etc.) may produce warnings from Docling about memory allocation or failing stages. In most cases, text extraction still succeeds, but portions of some pages might be incomplete.
These limitations are acceptable for a first version focused on reading technical/QA books in Portuguese on Kindle. Future versions can extend support for additional languages and image handling.
Contributing
Contributions are welcome. Suggested areas for improvement:
- Better handling of images and figures in the EPUB output.
- Additional language pairs, configurable via CLI flags.
- A robust Node.js wrapper and npm packaging.
- More detailed progress reporting (per chapter or per page).
Before opening a pull request, please:
- Run the existing scripts locally on at least one sample book.
- Keep Python code readable and small, with single‑responsibility modules.
- Avoid introducing breaking changes to the
cli.pyUX without discussion.
