knowlifijs

v0.2.1

Published

8 days ago

Turn arbitrary JSON into structured, explainable knowledge: schema discovery, domain detection, event classification, entity extraction, validation and confidence scoring.

0High
0Medium
0Low

nayft-org

json knowledge-extraction nlp entity-extraction domain-detection event-classification schema-discovery data-enrichment

KnowlifiJS

Turn arbitrary JSON into structured, explainable knowledge.

KnowlifiJS inspects JSON of any shape — a news feed, a CRM export, a sensor log — and produces a canonical "knowledge" representation: it finds the title/description, extracts entities, infers the domain (crypto, healthcare, legal, sports, ...), classifies the event/topic, checks for internal contradictions, validates the result, and scores its own confidence. Every decision comes with a human-readable reasoning trail.

Overview

KnowlifiJS is a domain-agnostic knowledge extraction pipeline. It does not hardcode a single schema — instead it discovers structure on the fly:

Schema Discovery — flattens any JSON object into typed paths (text, date, numeric, boolean).
Dynamic Field Detection — scores fields to find the most likely title and description, plus candidate entities and dates.
Entity Extraction — types and scores entities (organizations, people, locations, instruments, facilities, ...).
Domain Inference — infers the topical domain (crypto, healthcare, legal, sports, etc.) from a weighted vocabulary lexicon.
Event Classification — classifies the underlying event using a hierarchical taxonomy (Investment, Corporate, Legal, Security, Market...).
Consistency Checking — flags contradictions between headline, summary, and detected event sentiment.
Validation — sanity-checks that extracted facts are grounded in the source record.
Confidence Scoring — combines every signal above into a single, explainable confidence score (always < 1).

Every result includes a reasoning trail explaining why a domain or event was chosen.

Installation

npm install knowlifijs

Works in Node.js (>= 18) and modern browsers. Ships ESM + CommonJS builds with full TypeScript declarations.

Quick Start

import { parse } from 'knowlifijs';

const result = await parse({
  headline: 'Acme Robotics raises $50M in Series B funding',
  body: 'Acme Robotics, a leading robotics company, announced today that it raised $50M in a Series B funding round led by Vertex Holdings.',
  company: 'Acme Robotics',
  createdAt: '2025-12-28'
});

console.log(result.knowledge.primarySubject);
// "Acme Robotics raises $50M in Series B funding"

console.log(result.domain);
// { name: 'venture_capital', confidence: 0.74 }

console.log(result.event);
// { category: 'Investment', type: 'Funding Round', confidence: 0.9 }

console.log(result.confidence);
// 0.85

parse() also accepts an array of records and returns an array of results:

const results = await parse([recordA, recordB, recordC]);

Advanced Usage

Use the Parser class for repeated parsing with shared options, and to register plugins:

import { Parser } from 'knowlifijs';

const parser = new Parser({
  detectDomain: true,
  detectEvents: true,
  extractEntities: true,
  checkConsistency: true,
  validate: true
});

const result = await parser.parse(record);
const results = await parser.parseAll([recordA, recordB]);

Plugins

Plugins can post-process every parsed record without modifying core code:

import { Parser } from 'knowlifijs';
import type { KnowlifiPlugin } from 'knowlifijs';

const tagHighConfidence: KnowlifiPlugin = {
  name: 'tag-high-confidence',
  afterParse(result) {
    if (result.confidence > 0.8) {
      return {
        ...result,
        knowledge: {
          ...result.knowledge,
          keywords: [...result.knowledge.keywords, 'high-confidence']
        }
      };
    }
  }
};

const parser = new Parser({ plugins: [tagHighConfidence] });
// or: parser.use(tagHighConfidence);

See Extensibility below for the plugin contract.

Configuration

ParserOptions:

| Option | Type | Default | Description | | ------------------ | ----------------------- | ---------- | ---------------------------------------------------- | | detectDomain | boolean | true | Run domain inference (result.domain). | | detectEvents | boolean | true | Run event/topic classification (result.event). | | extractEntities | boolean | true | Run entity extraction (result.knowledge.entities). | | checkConsistency | boolean | true | Run the headline/summary contradiction check. | | validate | boolean | true | Run the integrity validation layer. | | includeSentiment | boolean | true | Compute and expose result.sentiment. | | outputMode | 'full' \| 'slim' | 'slim' | Whether to echo source.originalSchema. See Output Modes. | | limits | Partial<ParserLimits> | see below | Recursion/payload safety limits. See Security Limits. | | plugins | KnowlifiPlugin[] | [] | Plugins to register at construction time. |

Disabled phases return neutral placeholder values (e.g. { name: 'unknown', confidence: 0 }) so the result shape is always stable.

Output Modes

By default (outputMode: 'slim'), the result omits source.originalSchema to keep memory usage down — only source.detectedSchema (the flattened paths/field types KnowlifiJS discovered) is included. Use outputMode: 'full' to also echo back the original record, e.g. for debugging:

const slim = await parse(record); // result.source.originalSchema === undefined
const full = await parse(record, { outputMode: 'full' }); // result.source.originalSchema === record

Security Limits

Schema scanning enforces safety limits so that deeply nested, circular, or oversized input can never cause unbounded work or crash the process. Limits are never enforced by throwing — traversal stops early and a warning is appended to result.validation.warnings (which also sets result.validation.passed = false).

interface ParserLimits {
  maxDepth: number;           // default 20
  maxLeaves: number;          // default 500
  maxArrayItems: number;      // default 100
  maxPayloadSizeBytes: number; // default 2 * 1024 * 1024 (2MB)
}

Possible warnings: max_depth_exceeded, max_leaves_exceeded, max_array_items_exceeded, circular_reference_detected, payload_size_exceeded.

const result = await parse(record, {
  limits: { maxDepth: 10, maxArrayItems: 50 }
});

Adapters

Lightweight, dependency-free adapters convert common feed formats into records ready for parse()/parseAll():

import { parseJsonFeed, parseRssFeed, parse } from 'knowlifijs';

// JSON Feed (https://jsonfeed.org)
const records = parseJsonFeed(jsonFeedDocument);
const results = await parse(records);

// RSS 2.0 XML
const rssRecords = parseRssFeed(rssXmlString);
const rssResults = await parse(rssRecords);

parseRssFeed extracts title, description, link, pubDate, guid, author and category from each <item>, decodes basic XML entities, and strips CDATA wrappers. Unrecognized tags are ignored; malformed XML never throws.

Event Signatures

Every result includes a deterministic eventSignature string, intended for future semantic deduplication of equivalent records (e.g. the same funding round reported by multiple sources):

result.eventSignature; // "funding_round:openai:microsoft"

It's built from the detected event type plus the top entities (or the primary subject if no entities were found), slugified and joined with :.

Sentiment Analysis

When includeSentiment is enabled (default), each result includes a sentiment field:

result.sentiment;
// {
//   polarity: 'positive',
//   score: 0.6,
//   positiveMatches: ['surged', 'growth'],
//   negativeMatches: [],
//   confidence: 0.6,
//   reasoning: ['Positive words: surged, growth']
// }

Set includeSentiment: false to omit it.

Performance Notes

Domain lexicon regexes are compiled once at module load, not per record.
Entity occurrence counting avoids per-call regex compilation.
The integrity validation layer skips JSON.stringify(record) entirely when there are no extracted entities to verify.
Default outputMode: 'slim' avoids retaining the original record in every result.

See benchmarks/REPORT.md for before/after numbers. Run npm run bench to reproduce locally.

Socialyx Integration Example

import { Parser, parseRssFeed } from 'knowlifijs';

const parser = new Parser({
  outputMode: 'slim',
  limits: { maxDepth: 15, maxPayloadSizeBytes: 1 * 1024 * 1024 },
  includeSentiment: true
});

const records = parseRssFeed(rssXmlFromUpstream);
const results = await parser.parseAll(records);

for (const result of results) {
  if (!result.validation.passed) continue; // skip records that hit safety limits
  socialyx.ingest({
    subject: result.knowledge.primarySubject,
    domain: result.domain.name,
    event: result.eventSignature,
    sentiment: result.sentiment?.polarity,
    confidence: result.confidence
  });
}

Architecture

JSON record
   │
   ▼
payload size guard   — reject-gracefully if > maxPayloadSizeBytes
   │
   ▼
scanSchema()         — flatten into typed paths/leaves
                        (depth/leaves/array limits, circular-ref detection)
   │
   ▼
detectFields()       — find title, description, entity candidates, dates
   │
   ▼
extractEntities()    — type + score entities
   │
   ▼
buildKnowledge()      — primarySubject, keywords, textSummary
   │
   ├─▶ inferDomain()         — domain + confidence + reasoning
   ├─▶ detectEvent()         — event category/type + confidence + reasoning
   ├─▶ buildEventSignature() — deterministic dedup key
   ├─▶ buildSentimentSummary() — sentiment polarity + confidence (optional)
   ├─▶ checkConsistency()    — headline/summary/event contradiction check
   └─▶ validate()            — integrity + safety-limit warnings
   │
   ▼
computeConfidenceBreakdown() — combine all signals + explainable breakdown
   │
   ▼
KnowledgeResult (outputMode: 'slim' | 'full')

Each phase lives in its own module under src/ and can be imported and used independently for advanced/extension scenarios:

import { scanSchema, detectFields, inferDomain } from 'knowlifijs';

Output Format

interface KnowledgeResult {
  source: {
    originalSchema?: JsonRecord; // only present when outputMode: 'full'
    detectedSchema: {
      paths: string[];
      textFields: string[];
      dateFields: string[];
      numericFields: string[];
      booleanFields: string[];
    };
  };
  knowledge: {
    primarySubject: string;
    entities: EntityResult[];
    keywords: string[];
    textSummary: string;
    sourceDomain: string;
  };
  domain: { name: string; confidence: number };
  event: { category: string; type: string; confidence: number };
  eventSignature: string;
  sentiment?: SentimentSummary; // present unless includeSentiment: false
  consistency: ConsistencyResult;
  validation: ValidationResult;
  confidence: number;
  confidenceBreakdown: ConfidenceBreakdown;
  reasoning: { domain: string[]; event: string[] };
}

Examples

Runnable examples live in examples/:

node examples/crypto.js
node examples/healthcare.js
node examples/legal.js
node examples/sports.js
node examples/startup-funding.js

Extensibility

KnowlifiJS supports a lightweight plugin architecture via KnowlifiPlugin:

interface KnowlifiPlugin {
  name: string;
  setup?: (parser: ParserLike) => void;
  afterParse?: (result: KnowledgeResult, record: JsonRecord) => KnowledgeResult | void;
}

setup runs once when the plugin is registered.
afterParse runs after every record is parsed and can return an augmented result (or void to leave it unchanged).

This is enough to build domain-specific plugins (e.g. a CyberSecurityPlugin or FinancePlugin) entirely outside of core — core never needs to change to support them.

FAQ

Does KnowlifiJS call any external APIs / LLMs? No. Everything runs locally using deterministic heuristics and lexicons.

Can I use this in the browser? Yes — the package has no Node-only dependencies and ships an ESM build.

Why is confidence always below 1? Confidence is intentionally capped (at 0.97) — the heuristics are probabilistic, and a perfect score would overstate certainty.

How do I add a new domain or event type? Extend DOMAIN_LEXICON (in src/domains/domainInference.ts) or EVENT_TAXONOMY (in src/events/eventDetector.ts), or — for purely additive behavior — write a plugin.

Contributing

See CONTRIBUTING.md.

License

MIT