npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@asi-ai/article-format-extractor

v0.1.0

Published

Deterministic DOCX format profile extractor for Node.js and CLI workflows.

Readme

@asi-ai/article-format-extractor

面向 Node.js 和命令行工作流的 DOCX 确定性格式画像提取工具。

本包读取 DOCX OpenXML 文件,输出稳定的 DocumentFormatProfile JSON,用于描述页面设置、标题样式、正文段落样式、编号定义等格式事实。

它适合作为格式库、格式对比、合同/文档格式检查,以及后续自动修订流程的底层基础能力。

功能特性

  • 仅支持 DOCX,基于 OpenXML package parts 解析。
  • 同时提供 Node API 和 CLI。
  • 输出稳定的 schemaVersion: "1.0.0"
  • 提供 JSON Schema,用于校验提取结果。
  • 提供结构化 warningerror 契约。
  • 对确定性格式事实保留 evidence ID。
  • 输出结果稳定,适合 snapshot、golden test 和批处理。

支持范围

当前 1.0 版本聚焦 DOCX 中可确定读取的格式事实:

  • word/document.xml
  • word/styles.xml
  • word/numbering.xml
  • section properties

当前版本不解析 PDF、Markdown、HTML、图片、URL、OCR、语义章节,也不尝试完整复刻 Word 渲染结果。

安装

npm install @asi-ai/article-format-extractor

要求 Node.js 20 或更高版本。

CLI 使用

article-format-extractor input.docx

格式化输出 JSON:

article-format-extractor input.docx --pretty

写入到指定文件:

article-format-extractor input.docx --output profile.json --pretty

输出 raw OpenXML 片段,便于调试:

article-format-extractor input.docx --include-raw

关闭正文示例:

article-format-extractor input.docx --include-examples false

CLI 输出规则

成功且未传 --output 时,stdout 输出 DocumentFormatProfile JSON。

成功且传入 --output 时,stdout 为空,JSON 写入指定文件。

warning 和 error 会以 JSON Lines 形式写入 stderr:

{ "type": "warning", "warning": { "code": "MULTIPLE_SECTIONS_NOT_FULLY_SUPPORTED" } }
{ "type": "error", "error": { "code": "UNSUPPORTED_FILE_TYPE" } }

退出码:

| 退出码 | 含义 | | --- | --- | | 0 | 成功 | | 1 | 内部错误或 schema 校验失败 | | 2 | 用户输入或参数错误 | | 3 | 文档读取或解析失败 |

Node API

import { extractDocumentFormat } from '@asi-ai/article-format-extractor';

const result = await extractDocumentFormat({
  type: 'path',
  path: '/path/to/input.docx',
});

if (result.ok) {
  console.log(result.profile);
  console.log(result.warnings);
} else {
  console.error(result.error);
}

使用 Buffer 输入:

import { readFile } from 'node:fs/promises';
import { extractDocumentFormat } from '@asi-ai/article-format-extractor';

const data = await readFile('/path/to/input.docx');
const result = await extractDocumentFormat({
  type: 'buffer',
  data,
  fileName: 'input.docx',
});

可选参数:

const result = await extractDocumentFormat(input, {
  parser: 'docx-openxml',
  includeRaw: false,
  includeExamples: true,
  timeoutMs: 30_000,
});

Schema

导入内置 schema:

import { documentFormatProfileSchema } from '@asi-ai/article-format-extractor/schema';

也可以直接引用 JSON 文件子路径:

import schema from '@asi-ai/article-format-extractor/schema.json' with { type: 'json' };

包内也导出了便捷校验函数:

import { validateDocumentFormatProfile } from '@asi-ai/article-format-extractor';

const validation = validateDocumentFormatProfile(profile);
if (!validation.ok) {
  console.error(validation.errors);
}

输出结构

成功提取时返回:

type ExtractResult =
  | {
      ok: true;
      profile: DocumentFormatProfile;
      warnings: FormatWarning[];
    }
  | {
      ok: false;
      error: FormatError;
      warnings: FormatWarning[];
      partialProfile?: DocumentFormatProfile;
    };

profile 包含以下核心字段:

| 字段 | 说明 | | --- | --- | | schemaVersion | 固定为 "1.0.0" | | source | 输入来源信息,例如输入类型、文件名、MIME type、文件大小 | | parser | 解析器名称和确定性模式 | | confidence | RuleValue 置信度的聚合结果 | | page | 可读取时输出纸张大小、方向、页边距 | | headingStyles | 确定性标题样式事实 | | paragraphStyles | 正文段落样式事实 | | listStyles | 来自 word/numbering.xml 的编号定义 | | evidence | 被格式事实引用的稳定证据记录 | | raw | 开启 includeRaw 后输出的可选调试数据 |

输出片段示例

{
  "schemaVersion": "1.0.0",
  "parser": {
    "name": "docx-openxml",
    "version": "0.1.0",
    "mode": "deterministic"
  },
  "paragraphStyles": [
    {
      "role": "body",
      "styleId": "Normal",
      "evidenceIds": ["ev:word_document:paragraph_0:body_role"],
      "examples": []
    }
  ],
  "listStyles": [],
  "evidence": []
}

Warning

warning 表示可恢复的降级,不会导致 result.ok 变为 false

常见 warning code:

| Code | 含义 | | --- | --- | | STYLES_PART_MISSING | 缺少 word/styles.xml | | STYLES_PART_UNREADABLE | word/styles.xml 无法解析 | | NUMBERING_PART_UNREADABLE | word/numbering.xml 无法解析 | | SECTION_PROPERTIES_MISSING | 未找到可读 section properties | | MULTIPLE_SECTIONS_NOT_FULLY_SUPPORTED | 文档存在多个 section,当前只输出第一个可读 section | | UNSUPPORTED_STYLE_FEATURE | 遇到 1.0 确定性子集之外的样式特性 | | RAW_OUTPUT_CONTAINS_SOURCE_CONTENT | raw 输出可能包含源文档内容 |

Error

致命失败会返回 ok: false 和稳定错误码。

常见 error code:

| Code | 含义 | | --- | --- | | UNSUPPORTED_INPUT_TYPE | 输入类型不是 pathbuffer | | UNSUPPORTED_FILE_TYPE | 输入不是 DOCX 文档 | | UNSUPPORTED_PARSER | 指定 parser 不受支持 | | FILE_NOT_FOUND | 输入路径不存在 | | INPUT_TOO_LARGE | 输入超过配置限制 | | DOCX_CORRUPTED | DOCX 压缩包损坏或不可读 | | DOCX_ENCRYPTED | DOCX 疑似加密 | | XML_PARSE_FAILED | 必需 XML part 无法解析 | | REQUIRED_PART_MISSING | 缺少必需 OpenXML part | | EMPTY_DOCUMENT | 未找到可解析的正文段落 | | PARSER_TIMEOUT | 解析超过配置的超时时间 | | ZIP_BOMB_SUSPECTED | 压缩包展开行为异常,疑似 zip bomb | | SCHEMA_VALIDATION_FAILED | 内部输出未通过 schema 校验 |

本地开发

安装依赖:

npm install

运行测试:

npm test

类型检查:

npm run typecheck

构建:

npm run build

预览 npm 包内容:

npm pack --dry-run

发布

当前包配置为公开 scoped package:

npm publish --access public

prepack 脚本会在打包或发布前自动执行测试、类型检查和构建。

License

MIT