n8n-nodes-tika
v0.1.2
Published
n8n community node for Apache Tika — extract text, metadata, detect MIME types and languages from documents (PDF, DOCX, PPTX, XLSX, images, and 1000+ formats)
Downloads
71
Maintainers
Readme
n8n-nodes-tika
n8n community node for Apache Tika — extract text, metadata, detect MIME types and languages from 1000+ document formats.
Apache Tika detects and extracts metadata and text from over a thousand different file types (PDF, DOCX, PPTX, XLSX, images, and more). This node lets you use Tika directly in your n8n workflows.
Operations
| Operation | Description | |-----------|-------------| | Extract Text | Extract text content from a document (plain text, HTML, or XML output) | | Extract Metadata | Extract metadata (author, title, creation date, etc.) | | Extract All (Recursive) | Extract text and metadata from archives/containers recursively | | Detect MIME Type | Detect the MIME type of a file | | Detect Language | Detect the language of text content |
Prerequisites
You need a running Apache Tika server. The easiest way is Docker:
docker run -p 9998:9998 apache/tika:latestOr add it to your docker-compose.yml:
tika:
image: apache/tika:latest
ports:
- "9998:9998"Installation
Community Node (recommended)
- Go to Settings (gear icon) > Community Nodes in your n8n instance
- Click Install a community node
- Enter
n8n-nodes-tika - Check the risk acknowledgment box and click Install
Note: If the Community Nodes option is not visible, add
N8N_COMMUNITY_PACKAGES_ENABLED=trueto your n8n environment variables. It is enabled by default in n8n 0.214+.
Manual Installation
cd ~/.n8n
npm install n8n-nodes-tikaThen restart n8n.
Configuration
- Create a new Tika API credential in n8n
- Set the Server URL to your Tika server:
- Docker Compose (both services in the same network):
http://tika:9998 - Standalone Docker / local:
http://localhost:9998
- Docker Compose (both services in the same network):
- Optionally adjust the Timeout for large documents (default: 30s)
Usage Examples
Extract text from a PDF
- Use an HTTP Request or Read Binary File node to get a PDF
- Connect it to the Tika node
- Set Operation to "Extract Text"
- The extracted text is available in
{{ $json.text }}
Extract metadata from a document
- Get a document file via any binary-producing node
- Set Operation to "Extract Metadata"
- Access metadata fields like
{{ $json["Content-Type"] }},{{ $json["dc:title"] }}, etc.
OCR an image
- Get an image file
- Set Operation to "Extract Text"
- Under Options, set OCR Language to the appropriate language code
- Tika will use Tesseract OCR to extract text from the image
Options
| Option | Description |
|--------|-------------|
| Output Format | Plain text, HTML, or XML (for Extract Text only) |
| OCR Language | Tesseract language code for OCR (default: eng) |
| OCR Strategy | auto, no_ocr, ocr_only, or ocr_and_text |
| Content Type Override | Manually set the document MIME type |
| Password | Password for encrypted documents |
Compatibility
- n8n version: 0.5+
- Node.js: 18+
