information-atomizer
v0.2.0
Published
Node-first package for atomizing source text into reviewable atomic statements.
Readme
information-atomizer
A Node.js library for decomposing source text into discrete, relationship-aware atomic statements.
information-atomizer ships with a built-in heuristic provider, hosted-provider subpaths for OpenAI, Azure OpenAI, Claude, Amazon Bedrock, Cohere, Mistral, MiniMax, and Vertex AI, and a single high-level atomize() API.
The Problem
Documents accumulate context. A five-page article may contain three distinct claims, each buried in paragraphs that exist only to provide background. When a person or AI system wants to reference one of those claims, they must pull in the surrounding noise: irrelevant history, implicit assumptions, transitional prose. The result is:
- Knowledge that is hard to cite precisely
- Statements that cannot be compared across sources
- No explicit structure for when sources agree, disagree, or partially overlap
Markdown does not solve this. It organizes presentation, not meaning.
The Atom Model
An atom is the minimal unit of knowledge that satisfies three properties:
- Self-contained — it can be read and evaluated without external context
- Singular — it expresses exactly one claim
- Addressable — it has a stable ID that can be referenced from other systems
type Atom = {
id: string; // stable unique identifier
statement: string; // the claim itself
tags: string[]; // domain classification
};Atoms carry no prose context. They are designed to be linked, not read sequentially.
Atom Relationships
Atoms alone are not enough. Knowledge becomes navigable when atoms carry explicit relationships to each other:
| Relationship | Meaning | |---|---| | Supports | Atom B provides evidence or reasoning that strengthens Atom A | | Conflicts | Atom B contradicts or challenges the claim in Atom A | | Correlates | Atom B is related to Atom A without direct logical dependency |
When includeRelationships: true is passed to atomize(), the library generates relationship candidates between atoms within the same request. For cross-request and persistent relationship management, see the consumer layer (see meeting-chain and provenatlas). This library is responsible for producing valid atoms from raw text.
Ecosystem
information-atomizer ← this library: text → atoms
↑ used by
meeting-chain ← live meeting capture → decision graph of atomized ideas
provenatlas ← public provenance atlas of historical knowledge- meeting-chain uses this package to atomize discussion from live meetings, track which proposals support or conflict with the meeting goal, and surface a decision graph for post-meeting review.
- provenatlas uses this package to atomize academic and historical sources, then visualizes how ideas have been formulated, referenced, supported, and challenged across time.
Install
pnpm add information-atomizerInstall only the optional SDKs for the hosted providers you plan to use:
pnpm add openai @anthropic-ai/sdk @aws-sdk/client-bedrock-runtime cohere-ai @mistralai/mistralai @google/genaiQuick Start
import { atomize, heuristicProvider, type Atom } from "information-atomizer";
const existingAtoms: Atom[] = [];
const result = await atomize(
{
text: "Water boils at 100°C at sea level. The boiling point decreases at higher altitudes.",
existingTags: ["physics", "thermodynamics"],
existingAtoms,
},
{
provider: heuristicProvider,
removeDuplicates: false,
},
);
// result.candidates: Array of AtomizedCandidate
// Each candidate has: id, statement, tags, rationale, duplicateMatchesAPI
atomize(input, options): Promise<AtomizeResult>
Input fields:
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Source text to decompose |
| existingAtoms | Atom[] | No | Previously stored atoms for duplicate detection |
| existingTags | string[] | No | Tag vocabulary to guide classification |
| language | string | No | Language for output atoms (e.g. "en", "ko"). Defaults to "auto" which preserves the source text language |
Options:
| Field | Type | Required | Description |
|---|---|---|---|
| provider | AtomizerProvider | Yes | Primary provider |
| fallbackProvider | AtomizerProvider | No | Used if primary throws |
| removeDuplicates | boolean | No | Filter high-confidence duplicates from output |
| includeRelationships | boolean | No | Generate relationship candidates between atoms |
Result:
| Field | Type | Description |
|---|---|---|
| provider | string | Name of the provider that ran |
| message | string | Summary from the provider |
| candidates | AtomizedCandidate[] | Generated atoms |
| relationships | AtomRelationship[] | Relationship candidates (only when includeRelationships: true) |
Each AtomizedCandidate includes:
id— generated UUIDstatement— the atomic claimtags— domain tagsrationale— why this was extracted as a standalone atomduplicateMatches— potential matches inexistingAtomsOnly
highconfidence matches are removed.mediumconfidence matches remain incandidatesfor review.The function never mutates existing atoms.
Removed duplicates are omitted from the response entirely.
If provider-specific duplicate checking is unavailable or fails, the package falls back to the built-in heuristic matcher.
Relationship generation
When includeRelationships: true is passed, the result includes a relationships array with candidate relationships between the generated atoms:
const result = await atomize(
{ text: "Solar panels generate electricity. However solar panels require direct sunlight." },
{ provider: heuristicProvider, includeRelationships: true },
);
// result.relationships: AtomRelationship[]
// Each relationship has: from, to, type, reasonEach AtomRelationship includes:
from— ID of the source atomto— ID of the target atomtype—"supports","conflicts", or"correlates"reason— explanation of why the relationship was detected
Relationships are scoped to atoms generated within the same request. When includeRelationships is omitted or false, the relationships field is not present on the result.
Providers
Built-in: heuristicProvider
A deterministic, dependency-free provider. Splits text by sentence boundaries and applies tag heuristics. No API calls. Suitable for development, testing, and offline use.
import { heuristicProvider } from "information-atomizer";Hosted providers
The root package exports:
atomizeheuristicProvider- shared input, output, and provider types
Hosted providers are exported from dedicated subpaths:
information-atomizer/openaiinformation-atomizer/azureinformation-atomizer/claudeinformation-atomizer/bedrockinformation-atomizer/cohereinformation-atomizer/mistralinformation-atomizer/minimaxinformation-atomizer/vertex
Each hosted factory accepts explicit config and never reads environment variables on its own.
Example:
import { atomize } from "information-atomizer";
import { createClaudeProvider } from "information-atomizer/claude";
const provider = createClaudeProvider({
apiKey: process.env.ANTHROPIC_API_KEY!,
});
const result = await atomize({ text: "..." }, { provider, removeDuplicates: true });Configuration matrix:
createOpenAIProvider({ apiKey, model? })Defaultsmodeltogpt-5.4-mini.createAzureProvider({ baseUrl, apiKey, deployment })Uses the Azure OpenAI deployment as the request model.createClaudeProvider({ apiKey, model? })Defaultsmodeltoclaude-haiku-4-5.createBedrockProvider({ region, modelId, credentials })Requires explicit AWS credentials and model selection.createCohereProvider({ apiKey, model? })Defaultsmodeltocommand-a-03-2025.createMistralProvider({ apiKey, model? })Defaultsmodeltomistral-medium-latest.createMinimaxProvider({ apiKey, model?, baseURL? })DefaultsmodeltoMiniMax-M2.5-highspeed.createVertexProvider({ project, location?, model?, serviceAccount? })Defaultslocationtoglobalandmodeltogemini-2.5-flash.
All hosted providers return the same candidate shape as the built-in heuristic provider. If provider-level duplicate review is unavailable or fails, the package falls back to the built-in duplicate matcher.
Custom Providers
Implement the AtomizerProvider interface to integrate any model or service:
import type { AtomizerProvider } from "information-atomizer";
const myProvider: AtomizerProvider = {
name: "my-provider",
atomize: async (input) => ({
message: "Atomized via custom provider",
candidates: [
{
statement: "...",
tags: ["example"],
rationale: "Single distinct claim extracted from input",
},
],
}),
};Language Handling
By default (language: "auto"), atoms are produced in the same language as the source text. The library does not translate.
- Single-language input — all atoms use the source language.
- Mixed-language input — each atom preserves the language of its source segment.
- Explicit language — set
languageto a language code (e.g."en","ko") to instruct hosted providers to output atoms in that language.
// Korean input produces Korean atoms
const result = await atomize(
{ text: "물은 해수면에서 100°C에 끓습니다.", language: "auto" },
{ provider: heuristicProvider },
);
// Force English output regardless of source language
const result = await atomize(
{ text: "물은 해수면에서 100°C에 끓습니다.", language: "en" },
{ provider: createOpenAIProvider({ apiKey }) },
);The heuristic provider always preserves source language since it performs no translation. The language setting primarily affects hosted (LLM) providers via their prompt instructions.
Duplicate Detection
When existingAtoms is provided, each candidate is checked for overlap against the existing atom set.
highconfidence matches indicate near-identical statements. WhenremoveDuplicates: true, these candidates are silently dropped.mediumconfidence matches are retained in the output for human review.
The library never mutates the existing atom set.
How It Works
flowchart LR
A["Raw source text"] --> B["Normalize input<br/>trim text, tags, existing atoms"]
B --> C{"Choose provider"}
C --> D["Heuristic provider<br/>sentence splitting + tag heuristics"]
C --> E["Hosted/custom provider<br/>LLM extracts candidate claims"]
D --> F["Candidate drafts<br/>statement + tags + rationale"]
E --> F
F --> G["Normalize candidates<br/>stable ids, cleaned statements, normalized tags"]
G --> H["Deduplicate repeated candidate drafts"]
H --> I{"Existing atoms provided?"}
I -->|"No"| J["Return reviewable atoms"]
I -->|"Yes"| K["Check duplicate matches<br/>provider checker or heuristic fallback"]
K --> L{"removeDuplicates = true?"}
L -->|"No"| J
L -->|"Yes"| M["Drop only high-confidence duplicates"]
M --> J
J --> N["Output<br/>provider + message + candidates"]This is the package's main job: turn messy source text into small, self-contained claims that downstream tools can review, store, compare, and connect.
Example App
A working Next.js consumer is included in examples/web. It provides a UI for pasting text, selecting a provider, toggling duplicate removal, and inspecting candidate atoms with their rationale.
nvm use 24
pnpm install
pnpm --dir examples/web devCopy environment variables for Vertex:
cp examples/web/.env.example examples/web/.env.localEnvironment variables used by the example app:
| Variable | Required | Description |
|---|---|---|
| GOOGLE_CLOUD_PROJECT | No | Google Cloud project ID |
| GOOGLE_CLOUD_LOCATION | No | Vertex region |
| VERTEX_MODEL | No | Vertex model name |
| FIREBASE_SERVICE_ACCOUNT_KEY | No | Service-account JSON or base64 JSON |
With no Vertex configuration, the example app falls back to the built-in heuristic provider.
Development
Requires Node 24 (via nvm) and pnpm.
nvm use 24
pnpm install
pnpm lint # TypeScript type check
pnpm test # Vitest unit tests
pnpm build # tsup bundleDefault development scope is the publishable package under src/ with verification in tests/. The example app is a reference consumer, not part of the default package change workflow.
The default verification gate is:
pnpm testfor regression coverage across atomization, duplicate review, exports, tags, and provider behaviorpnpm lintfor strict TypeScript validationpnpm buildfor ESM, CJS, and type declaration output
If you need to validate the example app explicitly, run its commands separately:
pnpm --dir examples/web lint
pnpm --dir examples/web buildSee CONTRIBUTING.md for branch conventions and release process.
한국어
information-atomizer란?
information-atomizer는 소스 텍스트를 검토 가능한 원자적 진술(atomic statements)로 분해하는 Node.js 라이브러리입니다.
해결하는 문제
기존 문서(특히 Markdown)는 여러 개념과 불필요한 맥락이 혼재되어 있습니다. 하나의 주장을 참조하려면 전체 문서를 가져와야 하며, 서로 다른 출처의 주장을 비교하거나 명시적으로 연결하는 것이 불가능합니다.
정보 원자화는 이 문제를 해결합니다:
- 각 진술은 독립적으로 이해 가능한 최소 단위
- 각 진술은 고유 ID를 가지며 다른 시스템에서 참조 가능
- 지식 그래프에서 진술 간 관계(지지/충돌/상관)를 명시적으로 표현
원자(Atom) 모델
type Atom = {
id: string; // 고유 식별자
statement: string; // 원자적 진술
tags: string[]; // 도메인 분류 태그
};원자 간 관계
| 관계 | 의미 | |---|---| | 지지 (Supports) | Atom B가 Atom A의 주장을 강화하는 근거나 추론 제공 | | 충돌 (Conflicts) | Atom B가 Atom A의 주장과 모순되거나 도전함 | | 상관 (Correlates) | Atom B가 Atom A와 직접적 논리적 의존 없이 관련됨 |
생태계
information-atomizer ← 이 라이브러리: 텍스트 → 원자
↑ 사용
meeting-chain ← 회의 실시간 캡처 → 의사결정 그래프
provenatlas ← 역사적 지식의 공개 출처 아틀라스- meeting-chain: 회의 중 발언을 원자화하여 의제 달성 여부를 추적하고 의사결정 그래프를 생성합니다.
- provenatlas: 학술/역사 자료를 원자화하여 아이디어가 역사적으로 어떻게 형성되고 인용되고 충돌해왔는지 시각화합니다.
설치
pnpm add information-atomizer기본 사용법
import { atomize, heuristicProvider } from "information-atomizer";
const result = await atomize(
{ text: "물은 해수면에서 100°C에 끓습니다." },
{ provider: heuristicProvider },
);언어 처리
기본값(language: "auto")에서는 원자가 소스 텍스트와 동일한 언어로 생성됩니다. 번역은 수행하지 않습니다.
- 단일 언어 입력 — 모든 원자가 소스 언어를 사용합니다.
- 혼합 언어 입력 — 각 원자가 해당 소스 구간의 언어를 유지합니다.
- 명시적 언어 지정 —
language를 언어 코드(예:"en","ko")로 설정하면 호스팅 프로바이더가 해당 언어로 원자를 출력합니다.
동작 흐름
flowchart LR
A["원본 텍스트"] --> B["입력 정규화<br/>텍스트, 태그, 기존 원자 정리"]
B --> C{"Provider 선택"}
C --> D["Heuristic provider<br/>문장 분리 + 태그 추론"]
C --> E["Hosted/custom provider<br/>LLM이 후보 주장 추출"]
D --> F["후보 초안 생성<br/>statement + tags + rationale"]
E --> F
F --> G["후보 정규화<br/>안정적 ID, 정리된 진술, 정규화된 태그"]
G --> H["중복 후보 초안 제거"]
H --> I{"기존 원자 존재?"}
I -->|"아니오"| J["검토 가능한 원자 반환"]
I -->|"예"| K["중복 검사<br/>provider checker 또는 heuristic fallback"]
K --> L{"removeDuplicates = true?"}
L -->|"아니오"| J
L -->|"예"| M["high 신뢰도 중복만 제거"]
M --> J
J --> N["출력<br/>provider + message + candidates"]이 라이브러리의 핵심 역할은 맥락이 섞인 텍스트를 작고 독립적인 주장 단위로 바꾸어, 이후 시스템이 저장·비교·연결할 수 있게 만드는 것입니다.
개발 환경 설정
nvm use 24
pnpm install
pnpm lint
pnpm test
pnpm build기본 개발 범위는 src/ 아래의 배포 대상 패키지이며, 검증은 tests/에서 수행합니다. 예제 앱은 참조용 소비자 앱으로, 기본 패키지 변경 워크플로에 포함되지 않습니다.
기본 검증 게이트는 다음과 같습니다:
pnpm test로 원자화, 중복 검토, export, 태그, provider 동작 전반의 회귀를 검증합니다.pnpm lint로 엄격한 TypeScript 검사를 수행합니다.pnpm build로 ESM, CJS, 타입 선언 산출물을 확인합니다.
예제 앱까지 별도로 검증해야 한다면 다음 명령을 추가로 실행하세요:
pnpm --dir examples/web lint
pnpm --dir examples/web build자세한 기여 방법은 CONTRIBUTING.md를 참고하세요.
