@agentxin-ai/plugin-unstructured
v0.0.3
Published
`@agentxin-ai/plugin-unstructured` brings the [Unstructured](https://www.unstructured.io/) document partitioning API into the [AgentXin AI](https://github.com/agentxin-ai/agentxin) platform. It lets ingestion workflows call Unstructured’s `/general/v0/gen
Readme
AgentXin Plugin: Unstructured
@agentxin-ai/plugin-unstructured brings the Unstructured document partitioning API into the AgentXin AI platform. It lets ingestion workflows call Unstructured’s /general/v0/general endpoint, convert PDFs and mixed-format documents into markdown chunks, persist image/table assets to the AgentXin file system, and return Document<ChunkMetadata> objects that are ready for downstream RAG pipelines.
Installation
pnpm add @agentxin-ai/plugin-unstructured
# or
npm install @agentxin-ai/plugin-unstructuredPeer dependencies:
@agentxin-ai/plugin-sdk,@nestjs/common@^11,@nestjs/config@^4,@metad/contracts,@langchain/core@^0.3.72,chalk@4,lodash-es, andzod. Install them in the host service if they are not already available.
Quick Start
Collect Unstructured credentials
- Create an API key from the Unstructured dashboard.
- Default base URL:
https://api.unstructuredapp.io.
Configure the integration
Via AgentXin AI System: Add an integration of type
unstructuredand fill in the fields below.Via environment variables (fallback when no integration is stored):
export UNSTRUCTURED_API_BASE_URL=https://api.unstructuredapp.io export UNSTRUCTURED_API_TOKEN=your-unstructured-api-keyExample integration JSON:
{ "provider": "unstructured", "options": { "apiUrl": "https://api.unstructuredapp.io", "apiKey": "your-unstructured-api-key" } }
Register the plugin with the AgentXin runtime
PLUGINS=@agentxin-ai/plugin-unstructuredregister()wires up theUnstructuredPluginNestJS module globally and logs lifecycle events duringonStart/onStop.Trigger a document transformation
When a knowledge ingestion job selects the “Unstructured” transformer, the plugin reads the source file via
XpFileSystem, sends it to Unstructured, converts the response to markdown chunks, and writes referenced images/tables under the same folder.To smoke-test credentials you can call the built-in controller:
curl -X POST http://localhost:3000/unstructured/test \ -H 'Content-Type: application/json' \ -d '{"options":{"apiUrl":"https://api.unstructuredapp.io","apiKey":"YOUR_KEY"}}'A
401indicates malformed credentials (returned asBadRequestException), while200confirms partition access.
Integration Options
| Field | Type | Description | Required | Default |
| ------ | ------ | ---------------------------------------- | -------- | ------------------------------- |
| apiUrl | string | Base URL of your Unstructured deployment | No | https://api.unstructuredapp.io |
| apiKey | string | API key used for apiKeyAuth headers | Yes | — |
Environment variables
UNSTRUCTURED_API_BASE_URLandUNSTRUCTURED_API_TOKENare used when no integration config is injected; explicit integration values take precedence.
Transformer Parameters
UnstructuredTransformerStrategy forwards the following options to partitionParameters:
| Option | Type | Default | Notes |
| -------------------------- | ------------------- | ---------------- | ----- |
| chunkingStrategy | 'basic' \| 'by_title' \| 'by_page' \| 'by_similarity' | undefined | Matches Unstructured chunking presets. |
| maxCharacters | number | 1000 | Hard cap per chunk when chunking is enabled. |
| overlap | number | 0 | Character tail appended to the next chunk for continuity. |
| strategy | 'auto' \| 'fast' \| 'hi_res' \| 'ocr_only' \| 'vlm' | 'auto' | Parsing pipeline selection. |
| languages | string[] | ['chi_sim','eng'] | Accepts any Tesseract language code (see Unstructured docs). |
| splitPdfPage | boolean | false | Client-side page splitting hint; ignored by backend. |
| splitPdfConcurrencyLevel | number | 2 | Max concurrent requests when splitPdfPage is enabled. |
Every request automatically sets extractImageBlockTypes to ['Image','Table'] so that generated figures/tables can be written back as assets.
Permissions
- Integration: Requires
integration:unstructuredpermission to read API credentials during ingestion. - File System: Needs
read/write/listonXpFileSystemto fetch source binaries and persist derived image/table resources.
Ensure your policy grants both permissions to avoid runtime failures.
Output Content
- Markdown chunks: Each Unstructured element is converted into markdown (headings, paragraphs, bullet lists), then merged into a
Documentinstance. - Asset manifest: When
ImageorTableelements includeimage_base64, the plugin stores the decoded PNG under<source-folder>/images/<element-id>.png, returns the public URL, and exposes it asmetadata.assets. - Metadata: Every chunk contains
{ parser: 'unstructured', source: <filePath> }, enabling traceability and downstream splitting if desired.
Development & Debugging
npm install
npx nx build @agentxin-ai/plugin-unstructured
npx nx test @agentxin-ai/plugin-unstructuredBuild artifacts land in packages/unstructured/dist. Ensure the published package includes package.json, compiled JS, and type declarations.
License
This plugin follows the repository’s AGPL-3.0 License.
