@asi-ai/article-format-extractor

v0.1.0

Published

13 days ago

Deterministic DOCX format profile extractor for Node.js and CLI workflows.

0High
0Medium
0Low

zongzack

@asi-ai/article-format-extractor

面向 Node.js 和命令行工作流的 DOCX 确定性格式画像提取工具。

本包读取 DOCX OpenXML 文件，输出稳定的 DocumentFormatProfile JSON，用于描述页面设置、标题样式、正文段落样式、编号定义等格式事实。

它适合作为格式库、格式对比、合同/文档格式检查，以及后续自动修订流程的底层基础能力。

功能特性

仅支持 DOCX，基于 OpenXML package parts 解析。
同时提供 Node API 和 CLI。
输出稳定的 schemaVersion: "1.0.0"。
提供 JSON Schema，用于校验提取结果。
提供结构化 warning 和 error 契约。
对确定性格式事实保留 evidence ID。
输出结果稳定，适合 snapshot、golden test 和批处理。

支持范围

当前 1.0 版本聚焦 DOCX 中可确定读取的格式事实：

word/document.xml
word/styles.xml
word/numbering.xml
section properties

当前版本不解析 PDF、Markdown、HTML、图片、URL、OCR、语义章节，也不尝试完整复刻 Word 渲染结果。

安装

npm install @asi-ai/article-format-extractor

要求 Node.js 20 或更高版本。

CLI 使用

article-format-extractor input.docx

格式化输出 JSON：

article-format-extractor input.docx --pretty

写入到指定文件：

article-format-extractor input.docx --output profile.json --pretty

输出 raw OpenXML 片段，便于调试：

article-format-extractor input.docx --include-raw

关闭正文示例：

article-format-extractor input.docx --include-examples false

CLI 输出规则

成功且未传 --output 时，stdout 输出 DocumentFormatProfile JSON。

成功且传入 --output 时，stdout 为空，JSON 写入指定文件。

warning 和 error 会以 JSON Lines 形式写入 stderr：

{ "type": "warning", "warning": { "code": "MULTIPLE_SECTIONS_NOT_FULLY_SUPPORTED" } }
{ "type": "error", "error": { "code": "UNSUPPORTED_FILE_TYPE" } }

退出码：

| 退出码 | 含义 | | --- | --- | | 0 | 成功 | | 1 | 内部错误或 schema 校验失败 | | 2 | 用户输入或参数错误 | | 3 | 文档读取或解析失败 |

Node API

import { extractDocumentFormat } from '@asi-ai/article-format-extractor';

const result = await extractDocumentFormat({
  type: 'path',
  path: '/path/to/input.docx',
});

if (result.ok) {
  console.log(result.profile);
  console.log(result.warnings);
} else {
  console.error(result.error);
}

使用 Buffer 输入：

import { readFile } from 'node:fs/promises';
import { extractDocumentFormat } from '@asi-ai/article-format-extractor';

const data = await readFile('/path/to/input.docx');
const result = await extractDocumentFormat({
  type: 'buffer',
  data,
  fileName: 'input.docx',
});

可选参数：

const result = await extractDocumentFormat(input, {
  parser: 'docx-openxml',
  includeRaw: false,
  includeExamples: true,
  timeoutMs: 30_000,
});

Schema

导入内置 schema：

import { documentFormatProfileSchema } from '@asi-ai/article-format-extractor/schema';

也可以直接引用 JSON 文件子路径：

import schema from '@asi-ai/article-format-extractor/schema.json' with { type: 'json' };

包内也导出了便捷校验函数：

import { validateDocumentFormatProfile } from '@asi-ai/article-format-extractor';

const validation = validateDocumentFormatProfile(profile);
if (!validation.ok) {
  console.error(validation.errors);
}

输出结构

成功提取时返回：

type ExtractResult =
  | {
      ok: true;
      profile: DocumentFormatProfile;
      warnings: FormatWarning[];
    }
  | {
      ok: false;
      error: FormatError;
      warnings: FormatWarning[];
      partialProfile?: DocumentFormatProfile;
    };

profile 包含以下核心字段：

| 字段 | 说明 | | --- | --- | | schemaVersion | 固定为 "1.0.0" | | source | 输入来源信息，例如输入类型、文件名、MIME type、文件大小 | | parser | 解析器名称和确定性模式 | | confidence | RuleValue 置信度的聚合结果 | | page | 可读取时输出纸张大小、方向、页边距 | | headingStyles | 确定性标题样式事实 | | paragraphStyles | 正文段落样式事实 | | listStyles | 来自 word/numbering.xml 的编号定义 | | evidence | 被格式事实引用的稳定证据记录 | | raw | 开启 includeRaw 后输出的可选调试数据 |

输出片段示例

{
  "schemaVersion": "1.0.0",
  "parser": {
    "name": "docx-openxml",
    "version": "0.1.0",
    "mode": "deterministic"
  },
  "paragraphStyles": [
    {
      "role": "body",
      "styleId": "Normal",
      "evidenceIds": ["ev:word_document:paragraph_0:body_role"],
      "examples": []
    }
  ],
  "listStyles": [],
  "evidence": []
}

Warning

warning 表示可恢复的降级，不会导致 result.ok 变为 false。

常见 warning code：

| Code | 含义 | | --- | --- | | STYLES_PART_MISSING | 缺少 word/styles.xml | | STYLES_PART_UNREADABLE | word/styles.xml 无法解析 | | NUMBERING_PART_UNREADABLE | word/numbering.xml 无法解析 | | SECTION_PROPERTIES_MISSING | 未找到可读 section properties | | MULTIPLE_SECTIONS_NOT_FULLY_SUPPORTED | 文档存在多个 section，当前只输出第一个可读 section | | UNSUPPORTED_STYLE_FEATURE | 遇到 1.0 确定性子集之外的样式特性 | | RAW_OUTPUT_CONTAINS_SOURCE_CONTENT | raw 输出可能包含源文档内容 |

Error

致命失败会返回 ok: false 和稳定错误码。

常见 error code：

| Code | 含义 | | --- | --- | | UNSUPPORTED_INPUT_TYPE | 输入类型不是 path 或 buffer | | UNSUPPORTED_FILE_TYPE | 输入不是 DOCX 文档 | | UNSUPPORTED_PARSER | 指定 parser 不受支持 | | FILE_NOT_FOUND | 输入路径不存在 | | INPUT_TOO_LARGE | 输入超过配置限制 | | DOCX_CORRUPTED | DOCX 压缩包损坏或不可读 | | DOCX_ENCRYPTED | DOCX 疑似加密 | | XML_PARSE_FAILED | 必需 XML part 无法解析 | | REQUIRED_PART_MISSING | 缺少必需 OpenXML part | | EMPTY_DOCUMENT | 未找到可解析的正文段落 | | PARSER_TIMEOUT | 解析超过配置的超时时间 | | ZIP_BOMB_SUSPECTED | 压缩包展开行为异常，疑似 zip bomb | | SCHEMA_VALIDATION_FAILED | 内部输出未通过 schema 校验 |

本地开发

安装依赖：

npm install

运行测试：

npm test

类型检查：

npm run typecheck

构建：

npm run build

预览 npm 包内容：

npm pack --dry-run

发布

当前包配置为公开 scoped package：

npm publish --access public

prepack 脚本会在打包或发布前自动执行测试、类型检查和构建。

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@asi-ai/article-format-extractor

功能特性

支持范围

安装

CLI 使用

CLI 输出规则

Node API

Schema

输出结构

输出片段示例

Warning

Error

本地开发

发布

License