information-atomizer

v0.2.0

Published

2 months ago

Node-first package for atomizing source text into reviewable atomic statements.

0High
0Medium
0Low

dkim693

information-atomizer

A Node.js library for decomposing source text into discrete, relationship-aware atomic statements.

information-atomizer ships with a built-in heuristic provider, hosted-provider subpaths for OpenAI, Azure OpenAI, Claude, Amazon Bedrock, Cohere, Mistral, MiniMax, and Vertex AI, and a single high-level atomize() API.

The Problem

Documents accumulate context. A five-page article may contain three distinct claims, each buried in paragraphs that exist only to provide background. When a person or AI system wants to reference one of those claims, they must pull in the surrounding noise: irrelevant history, implicit assumptions, transitional prose. The result is:

Knowledge that is hard to cite precisely
Statements that cannot be compared across sources
No explicit structure for when sources agree, disagree, or partially overlap

Markdown does not solve this. It organizes presentation, not meaning.

The Atom Model

An atom is the minimal unit of knowledge that satisfies three properties:

Self-contained — it can be read and evaluated without external context
Singular — it expresses exactly one claim
Addressable — it has a stable ID that can be referenced from other systems

type Atom = {
  id: string;        // stable unique identifier
  statement: string; // the claim itself
  tags: string[];    // domain classification
};

Atoms carry no prose context. They are designed to be linked, not read sequentially.

Atom Relationships

Atoms alone are not enough. Knowledge becomes navigable when atoms carry explicit relationships to each other:

| Relationship | Meaning | |---|---| | Supports | Atom B provides evidence or reasoning that strengthens Atom A | | Conflicts | Atom B contradicts or challenges the claim in Atom A | | Correlates | Atom B is related to Atom A without direct logical dependency |

When includeRelationships: true is passed to atomize(), the library generates relationship candidates between atoms within the same request. For cross-request and persistent relationship management, see the consumer layer (see meeting-chain and provenatlas). This library is responsible for producing valid atoms from raw text.

Ecosystem

information-atomizer       ← this library: text → atoms
       ↑ used by
meeting-chain              ← live meeting capture → decision graph of atomized ideas
provenatlas                ← public provenance atlas of historical knowledge

meeting-chain uses this package to atomize discussion from live meetings, track which proposals support or conflict with the meeting goal, and surface a decision graph for post-meeting review.
provenatlas uses this package to atomize academic and historical sources, then visualizes how ideas have been formulated, referenced, supported, and challenged across time.

Install

pnpm add information-atomizer

Install only the optional SDKs for the hosted providers you plan to use:

pnpm add openai @anthropic-ai/sdk @aws-sdk/client-bedrock-runtime cohere-ai @mistralai/mistralai @google/genai

Quick Start

import { atomize, heuristicProvider, type Atom } from "information-atomizer";

const existingAtoms: Atom[] = [];

const result = await atomize(
  {
    text: "Water boils at 100°C at sea level. The boiling point decreases at higher altitudes.",
    existingTags: ["physics", "thermodynamics"],
    existingAtoms,
  },
  {
    provider: heuristicProvider,
    removeDuplicates: false,
  },
);

// result.candidates: Array of AtomizedCandidate
// Each candidate has: id, statement, tags, rationale, duplicateMatches

API

`atomize(input, options): Promise<AtomizeResult>`

Input fields:

| Field | Type | Required | Description | |---|---|---|---| | text | string | Yes | Source text to decompose | | existingAtoms | Atom[] | No | Previously stored atoms for duplicate detection | | existingTags | string[] | No | Tag vocabulary to guide classification | | language | string | No | Language for output atoms (e.g. "en", "ko"). Defaults to "auto" which preserves the source text language |

Options:

| Field | Type | Required | Description | |---|---|---|---| | provider | AtomizerProvider | Yes | Primary provider | | fallbackProvider | AtomizerProvider | No | Used if primary throws | | removeDuplicates | boolean | No | Filter high-confidence duplicates from output | | includeRelationships | boolean | No | Generate relationship candidates between atoms |

Result:

| Field | Type | Description | |---|---|---| | provider | string | Name of the provider that ran | | message | string | Summary from the provider | | candidates | AtomizedCandidate[] | Generated atoms | | relationships | AtomRelationship[] | Relationship candidates (only when includeRelationships: true) |

Each AtomizedCandidate includes:

id — generated UUID
statement — the atomic claim
tags — domain tags
rationale — why this was extracted as a standalone atom
duplicateMatches — potential matches in existingAtoms
Only high confidence matches are removed.
medium confidence matches remain in candidates for review.
The function never mutates existing atoms.
Removed duplicates are omitted from the response entirely.
If provider-specific duplicate checking is unavailable or fails, the package falls back to the built-in heuristic matcher.

Relationship generation

When includeRelationships: true is passed, the result includes a relationships array with candidate relationships between the generated atoms:

const result = await atomize(
  { text: "Solar panels generate electricity. However solar panels require direct sunlight." },
  { provider: heuristicProvider, includeRelationships: true },
);

// result.relationships: AtomRelationship[]
// Each relationship has: from, to, type, reason

Each AtomRelationship includes:

from — ID of the source atom
to — ID of the target atom
type — "supports", "conflicts", or "correlates"
reason — explanation of why the relationship was detected

Relationships are scoped to atoms generated within the same request. When includeRelationships is omitted or false, the relationships field is not present on the result.

Providers

Built-in: `heuristicProvider`

A deterministic, dependency-free provider. Splits text by sentence boundaries and applies tag heuristics. No API calls. Suitable for development, testing, and offline use.

import { heuristicProvider } from "information-atomizer";

Hosted providers

The root package exports:

atomize
heuristicProvider
shared input, output, and provider types

Hosted providers are exported from dedicated subpaths:

information-atomizer/openai
information-atomizer/azure
information-atomizer/claude
information-atomizer/bedrock
information-atomizer/cohere
information-atomizer/mistral
information-atomizer/minimax
information-atomizer/vertex

Each hosted factory accepts explicit config and never reads environment variables on its own.

Example:

import { atomize } from "information-atomizer";
import { createClaudeProvider } from "information-atomizer/claude";

const provider = createClaudeProvider({
  apiKey: process.env.ANTHROPIC_API_KEY!,
});

const result = await atomize({ text: "..." }, { provider, removeDuplicates: true });

Configuration matrix:

createOpenAIProvider({ apiKey, model? }) Defaults model to gpt-5.4-mini.
createAzureProvider({ baseUrl, apiKey, deployment }) Uses the Azure OpenAI deployment as the request model.
createClaudeProvider({ apiKey, model? }) Defaults model to claude-haiku-4-5.
createBedrockProvider({ region, modelId, credentials }) Requires explicit AWS credentials and model selection.
createCohereProvider({ apiKey, model? }) Defaults model to command-a-03-2025.
createMistralProvider({ apiKey, model? }) Defaults model to mistral-medium-latest.
createMinimaxProvider({ apiKey, model?, baseURL? }) Defaults model to MiniMax-M2.5-highspeed.
createVertexProvider({ project, location?, model?, serviceAccount? }) Defaults location to global and model to gemini-2.5-flash.

All hosted providers return the same candidate shape as the built-in heuristic provider. If provider-level duplicate review is unavailable or fails, the package falls back to the built-in duplicate matcher.

Custom Providers

Implement the AtomizerProvider interface to integrate any model or service:

import type { AtomizerProvider } from "information-atomizer";

const myProvider: AtomizerProvider = {
  name: "my-provider",
  atomize: async (input) => ({
    message: "Atomized via custom provider",
    candidates: [
      {
        statement: "...",
        tags: ["example"],
        rationale: "Single distinct claim extracted from input",
      },
    ],
  }),
};

Language Handling

By default (language: "auto"), atoms are produced in the same language as the source text. The library does not translate.

Single-language input — all atoms use the source language.
Mixed-language input — each atom preserves the language of its source segment.
Explicit language — set language to a language code (e.g. "en", "ko") to instruct hosted providers to output atoms in that language.

// Korean input produces Korean atoms
const result = await atomize(
  { text: "물은 해수면에서 100°C에 끓습니다.", language: "auto" },
  { provider: heuristicProvider },
);

// Force English output regardless of source language
const result = await atomize(
  { text: "물은 해수면에서 100°C에 끓습니다.", language: "en" },
  { provider: createOpenAIProvider({ apiKey }) },
);

The heuristic provider always preserves source language since it performs no translation. The language setting primarily affects hosted (LLM) providers via their prompt instructions.

Duplicate Detection

When existingAtoms is provided, each candidate is checked for overlap against the existing atom set.

high confidence matches indicate near-identical statements. When removeDuplicates: true, these candidates are silently dropped.
medium confidence matches are retained in the output for human review.

The library never mutates the existing atom set.

How It Works

flowchart LR
    A["Raw source text"] --> B["Normalize input<br/>trim text, tags, existing atoms"]
    B --> C{"Choose provider"}
    C --> D["Heuristic provider<br/>sentence splitting + tag heuristics"]
    C --> E["Hosted/custom provider<br/>LLM extracts candidate claims"]
    D --> F["Candidate drafts<br/>statement + tags + rationale"]
    E --> F
    F --> G["Normalize candidates<br/>stable ids, cleaned statements, normalized tags"]
    G --> H["Deduplicate repeated candidate drafts"]
    H --> I{"Existing atoms provided?"}
    I -->|"No"| J["Return reviewable atoms"]
    I -->|"Yes"| K["Check duplicate matches<br/>provider checker or heuristic fallback"]
    K --> L{"removeDuplicates = true?"}
    L -->|"No"| J
    L -->|"Yes"| M["Drop only high-confidence duplicates"]
    M --> J
    J --> N["Output<br/>provider + message + candidates"]

This is the package's main job: turn messy source text into small, self-contained claims that downstream tools can review, store, compare, and connect.

Example App

A working Next.js consumer is included in examples/web. It provides a UI for pasting text, selecting a provider, toggling duplicate removal, and inspecting candidate atoms with their rationale.

nvm use 24
pnpm install
pnpm --dir examples/web dev

Copy environment variables for Vertex:

cp examples/web/.env.example examples/web/.env.local

Environment variables used by the example app:

| Variable | Required | Description | |---|---|---| | GOOGLE_CLOUD_PROJECT | No | Google Cloud project ID | | GOOGLE_CLOUD_LOCATION | No | Vertex region | | VERTEX_MODEL | No | Vertex model name | | FIREBASE_SERVICE_ACCOUNT_KEY | No | Service-account JSON or base64 JSON |

With no Vertex configuration, the example app falls back to the built-in heuristic provider.

Development

Requires Node 24 (via nvm) and pnpm.

nvm use 24
pnpm install
pnpm lint      # TypeScript type check
pnpm test      # Vitest unit tests
pnpm build     # tsup bundle

Default development scope is the publishable package under src/ with verification in tests/. The example app is a reference consumer, not part of the default package change workflow.

The default verification gate is:

pnpm test for regression coverage across atomization, duplicate review, exports, tags, and provider behavior
pnpm lint for strict TypeScript validation
pnpm build for ESM, CJS, and type declaration output

If you need to validate the example app explicitly, run its commands separately:

pnpm --dir examples/web lint
pnpm --dir examples/web build

See CONTRIBUTING.md for branch conventions and release process.

한국어

information-atomizer란?

information-atomizer는 소스 텍스트를 검토 가능한 원자적 진술(atomic statements)로 분해하는 Node.js 라이브러리입니다.

해결하는 문제

기존 문서(특히 Markdown)는 여러 개념과 불필요한 맥락이 혼재되어 있습니다. 하나의 주장을 참조하려면 전체 문서를 가져와야 하며, 서로 다른 출처의 주장을 비교하거나 명시적으로 연결하는 것이 불가능합니다.

정보 원자화는 이 문제를 해결합니다:

각 진술은 독립적으로 이해 가능한 최소 단위
각 진술은 고유 ID를 가지며 다른 시스템에서 참조 가능
지식 그래프에서 진술 간 관계(지지/충돌/상관)를 명시적으로 표현

원자(Atom) 모델

type Atom = {
  id: string;        // 고유 식별자
  statement: string; // 원자적 진술
  tags: string[];    // 도메인 분류 태그
};

원자 간 관계

| 관계 | 의미 | |---|---| | 지지 (Supports) | Atom B가 Atom A의 주장을 강화하는 근거나 추론 제공 | | 충돌 (Conflicts) | Atom B가 Atom A의 주장과 모순되거나 도전함 | | 상관 (Correlates) | Atom B가 Atom A와 직접적 논리적 의존 없이 관련됨 |

생태계

information-atomizer       ← 이 라이브러리: 텍스트 → 원자
       ↑ 사용
meeting-chain              ← 회의 실시간 캡처 → 의사결정 그래프
provenatlas                ← 역사적 지식의 공개 출처 아틀라스

meeting-chain: 회의 중 발언을 원자화하여 의제 달성 여부를 추적하고 의사결정 그래프를 생성합니다.
provenatlas: 학술/역사 자료를 원자화하여 아이디어가 역사적으로 어떻게 형성되고 인용되고 충돌해왔는지 시각화합니다.

설치

pnpm add information-atomizer

기본 사용법

import { atomize, heuristicProvider } from "information-atomizer";

const result = await atomize(
  { text: "물은 해수면에서 100°C에 끓습니다." },
  { provider: heuristicProvider },
);

언어 처리

기본값(language: "auto")에서는 원자가 소스 텍스트와 동일한 언어로 생성됩니다. 번역은 수행하지 않습니다.

단일 언어 입력 — 모든 원자가 소스 언어를 사용합니다.
혼합 언어 입력 — 각 원자가 해당 소스 구간의 언어를 유지합니다.
명시적 언어 지정 — language를 언어 코드(예: "en", "ko")로 설정하면 호스팅 프로바이더가 해당 언어로 원자를 출력합니다.

동작 흐름

flowchart LR
    A["원본 텍스트"] --> B["입력 정규화<br/>텍스트, 태그, 기존 원자 정리"]
    B --> C{"Provider 선택"}
    C --> D["Heuristic provider<br/>문장 분리 + 태그 추론"]
    C --> E["Hosted/custom provider<br/>LLM이 후보 주장 추출"]
    D --> F["후보 초안 생성<br/>statement + tags + rationale"]
    E --> F
    F --> G["후보 정규화<br/>안정적 ID, 정리된 진술, 정규화된 태그"]
    G --> H["중복 후보 초안 제거"]
    H --> I{"기존 원자 존재?"}
    I -->|"아니오"| J["검토 가능한 원자 반환"]
    I -->|"예"| K["중복 검사<br/>provider checker 또는 heuristic fallback"]
    K --> L{"removeDuplicates = true?"}
    L -->|"아니오"| J
    L -->|"예"| M["high 신뢰도 중복만 제거"]
    M --> J
    J --> N["출력<br/>provider + message + candidates"]

이 라이브러리의 핵심 역할은 맥락이 섞인 텍스트를 작고 독립적인 주장 단위로 바꾸어, 이후 시스템이 저장·비교·연결할 수 있게 만드는 것입니다.

개발 환경 설정

nvm use 24
pnpm install
pnpm lint
pnpm test
pnpm build

기본 개발 범위는 src/ 아래의 배포 대상 패키지이며, 검증은 tests/에서 수행합니다. 예제 앱은 참조용 소비자 앱으로, 기본 패키지 변경 워크플로에 포함되지 않습니다.

기본 검증 게이트는 다음과 같습니다:

pnpm test로 원자화, 중복 검토, export, 태그, provider 동작 전반의 회귀를 검증합니다.
pnpm lint로 엄격한 TypeScript 검사를 수행합니다.
pnpm build로 ESM, CJS, 타입 선언 산출물을 확인합니다.

예제 앱까지 별도로 검증해야 한다면 다음 명령을 추가로 실행하세요:

pnpm --dir examples/web lint
pnpm --dir examples/web build

자세한 기여 방법은 CONTRIBUTING.md를 참고하세요.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

information-atomizer

The Problem

The Atom Model

Atom Relationships

Ecosystem

Install

Quick Start

API

atomize(input, options): Promise<AtomizeResult>

Relationship generation

Providers

Built-in: heuristicProvider

Hosted providers

Custom Providers

Language Handling

Duplicate Detection

How It Works

Example App

Development

한국어

information-atomizer란?

해결하는 문제

원자(Atom) 모델

원자 간 관계

생태계

설치

기본 사용법

언어 처리

동작 흐름

개발 환경 설정

`atomize(input, options): Promise<AtomizeResult>`

Built-in: `heuristicProvider`