unified-ner

v1.0.4

Published

8 months ago

A lightweight TypeScript package for extracting top named entities from text

0High
0Medium
0Low

shimry

ner named-entity-recognition nlp entity-extraction

unified-ner

A lightweight TypeScript package for extracting top named entities from text using Compromise and Wink-NLP.

Features

Two Operation Modes:
- Detect: Count entities by type (e.g., "5 people, 3 organizations")
- Extract: Get actual entities with limits per type (e.g., top 5 people, top 5 orgs)
Extract named entities: People, Organizations, Locations, and Email Addresses
Rank entities by frequency
Learning functionality to filter out known entities
Minimal dependencies (only 2 NLP packages + Azure Storage)
TypeScript support with full type definitions
MIT licensed dependencies
Flexible API with customization options
Regex-based email extraction for high accuracy
Multi-tenant support with Azure Blob Storage

Installation

npm install unified-ner

Usage

Basic Usage

Detect Mode - Count Entities by Type

import { detectEntities } from "unified-ner";

const text = `
  Apple Inc. announced that Tim Cook will meet with President Biden in Washington.
  The meeting will discuss technology policy with Microsoft and Google.
  Contact us at [email protected] or [email protected] for more details.
`;

// Get counts of each entity type
const counts = detectEntities(text);
console.log(counts);
// {
//   PERSON: 2,        // Tim Cook, Biden
//   ORGANIZATION: 3,   // Apple Inc, Microsoft, Google
//   LOCATION: 1,       // Washington
//   EMAIL: 2          // [email protected], [email protected]
// }

Extract Mode - Get Actual Entities with Limits

import { extractEntities } from "unified-ner";

const text = `
  Apple Inc. announced that Tim Cook will meet with President Biden in Washington.
  The meeting will discuss technology policy with Microsoft and Google.
  Contact us at [email protected] or [email protected] for more details.
`;

// Get top 2 entities of each type
const entities = extractEntities(text, {
  maxPeople: 2,
  maxOrganizations: 2,
  maxLocations: 2,
  maxEmails: 2,
});
console.log(entities);
// [
//   { text: 'tim cook', type: 'PERSON', count: 1 },
//   { text: 'biden', type: 'PERSON', count: 1 },
//   { text: 'apple inc.', type: 'ORGANIZATION', count: 1 },
//   { text: 'microsoft', type: 'ORGANIZATION', count: 1 },
//   { text: 'washington', type: 'LOCATION', count: 1 },
//   { text: '[email protected]', type: 'EMAIL', count: 1 },
//   { text: '[email protected]', type: 'EMAIL', count: 1 }
// ]

Legacy Mode - Extract Top N Entities

import { extractNamedEntities } from "unified-ner";

// Get top 5 named entities (legacy mode)
const entities = extractNamedEntities(text, 5);
console.log(entities);
// [
//   { text: 'apple inc.', type: 'ORGANIZATION', count: 1 },
//   { text: 'tim cook', type: 'PERSON', count: 1 },
//   { text: 'biden', type: 'PERSON', count: 1 },
//   { text: 'washington', type: 'LOCATION', count: 1 },
//   { text: 'microsoft', type: 'ORGANIZATION', count: 1 }
// ]

Extract Specific Entity Types

import {
  extractPeople,
  extractOrganizations,
  extractLocations,
  extractEmails,
} from "unified-ner";

// Extract only people
const people = extractPeople(text, 5);

// Extract only organizations
const orgs = extractOrganizations(text, 5);

// Extract only locations
const locations = extractLocations(text, 5);

// Extract only email addresses
const emails = extractEmails(text, 5);

Advanced Options

import { extractNamedEntities } from "unified-ner";

const entities = extractNamedEntities(text, 10, {
  people: true, // Include people (default: true)
  organizations: true, // Include organizations (default: true)
  locations: false, // Exclude locations
  emails: true, // Include email addresses (default: true)
  normalize: true, // Normalize text for comparison (default: true)
});

Learning Functionality

The package includes intelligent learning capabilities that help filter out known entities, returning only NEW entities from your extractions. This is particularly useful for processing emails, documents, or any content where you want to focus on previously unknown entities.

Key Use Case: Email Processing

When processing emails, you often want to extract entities but exclude the sender's name (which you already know). The learning functionality helps you:

Learn email senders: Use learnUsers() with the sender's name to avoid extracting them repeatedly
Avoid generic emails: Learn organization names to filter out generic email addresses like "[email protected]" and focus on the actual organization name "ACME Invoices"

Setup

Configure Azure Storage (required for learning):

export UNIFIED_NER_AZURE_CONNECTION_STRING="your_azure_connection_string"
export UNIFIED_NER_AZURE_CONTAINER_NAME="your_container_name"

Or create a .env file in your project (auto-loaded):

UNIFIED_NER_AZURE_CONNECTION_STRING=your_azure_connection_string
UNIFIED_NER_AZURE_CONTAINER_NAME=your_container_name

Configure batch size and refresh interval in config.json:

{
  "saveBatchSize": 10,
  "cacheRefreshIntervalHours": 24
}

Learning Functions

`learnUsers(name)`

Learn user names to filter them out from future extractions.

import { learnUsers } from "unified-ner";

// Learn a single user
learnUsers("John Doe");

// Learn multiple users
learnUsers(["Jane Smith", "Bob Johnson"]);

Recommended Usage: Call learnUsers() with the sender name from each email you process. This helps avoid extracting the same sender repeatedly.

`learnEmployees(name)` (Deprecated)

Deprecated: Use learnUsers() instead. This function is kept for backward compatibility.

import { learnEmployees } from "unified-ner";

// This will show a deprecation warning
learnEmployees("John Doe");

`learnOrganizations(name)`

Learn organization names to filter them out from future extractions.

import { learnOrganizations } from "unified-ner";

// Learn a single organization
learnOrganizations("ACME Corporation");

// Learn multiple organizations
learnOrganizations(["Microsoft", "Google", "Apple"]);

`learnEntities(entityType, name)`

Generic function to learn entities of any supported type.

import { learnEntities } from "unified-ner";

learnEntities("PERSON", "John Doe");
learnEntities("ORGANIZATION", "ACME Corp");

Data Management

`importLearnedData(jsonData)`

Import learned entities from external JSON data.

import { importLearnedData } from "unified-ner";

const existingData = {
  PERSON: ["John Doe", "Jane Smith"],
  ORGANIZATION: ["ACME Corp", "Tech Inc"],
};

importLearnedData(existingData);

`exportLearnedData()`

Export current learned entities as JSON.

import { exportLearnedData } from "unified-ner";

const learnedData = exportLearnedData();
console.log(learnedData);
// { PERSON: ["john doe", "jane smith"], ORGANIZATION: ["acme corp"] }

`flushLearnedData(tenantId?)`

Force immediate save to Azure Blob Storage (bypasses batch size).

import { flushLearnedData } from "unified-ner";

// Save to default blob
await flushLearnedData();

// Save to tenant-specific blob (creates or updates "{tenantId}.json")
await flushLearnedData("tenant1");

Example Workflow

import {
  extractNamedEntities,
  learnUsers,
  learnOrganizations,
} from "unified-ner";

// Process an email
const emailContent = `
  From: John Doe <[email protected]>
  Subject: Meeting with ACME Corporation
  
  Hi team,
  
  I met with Sarah Johnson from ACME Corporation yesterday.
  We discussed the project timeline with Microsoft.
`;

// Learn the sender to avoid extracting them
learnUsers("John Doe");

// Learn known organizations
learnOrganizations(["ACME Corporation", "Microsoft"]);

// Extract entities - will exclude learned names
const newEntities = extractNamedEntities(emailContent, 10);
console.log(newEntities);
// Only returns: Sarah Johnson (new person), other unknown entities
// Excludes: John Doe, ACME Corporation, Microsoft

Important Notes

Case Insensitive: All learning is case-insensitive ("John Doe" and "john doe" are treated as the same)
Persistence: Learned data is automatically saved to Azure Blob Storage
Batching: Data is saved in batches (configurable via config.json) for performance
Fallback: If Azure Storage is not configured, learning functions will throw errors, but extraction functions work normally
Filtering: Only PERSON and ORGANIZATION entities are filtered; LOCATION and EMAIL entities are always returned

Configuration Check

import { isLearningConfigured } from "unified-ner";

if (isLearningConfigured()) {
  console.log("Learning functionality is available");
} else {
  console.log("Azure Storage not configured - learning disabled");
}

Implementation Recommendations

Daily Cache Refresh

For production applications, it's recommended to refresh the learned data cache periodically to ensure data consistency across multiple instances:

// CommonJS (require)
const { learnedCache } = require("unified-ner");

// ES Modules (import)
import pkg from "unified-ner";
const { learnedCache } = pkg;

// Load configuration
const config = require("./config.json");
const refreshIntervalMs = config.cacheRefreshIntervalHours * 60 * 60 * 1000;

// Set up periodic cache refresh (default: every 24 hours)
setInterval(async () => {
  try {
    await learnedCache.loadCache();
    console.log("Learned data cache refreshed from Azure Storage");
  } catch (error) {
    console.error("Failed to refresh learned data cache:", error);
  }
}, refreshIntervalMs);

// Initial load on application startup
(async () => {
  try {
    await learnedCache.loadCache();
    console.log("Initial cache loaded successfully");
  } catch (error) {
    console.error("Failed to load initial cache:", error);
  }
})();

Configuration in config.json:

cacheRefreshIntervalHours: 24 (recommended for production)
Adjust based on how frequently learned data changes
Lower values = more API calls to Azure Storage but fresher data

Multi-Tenant Best Practices

For multi-tenant deployments, use a single container with tenant-specific blob names:

import {
  learnUsers,
  learnOrganizations,
  flushLearnedData,
  learnedCache,
} from "unified-ner";

// Single container for all tenants (e.g., "named-entities")
// Each tenant gets their own blob named "{tenantId}.json" in this container

const tenantId = getCurrentTenantId(); // Your tenant identification logic

// Workflow: Load tenant's cache -> Learn entities -> Flush to tenant's blob
await learnedCache.loadCache(tenantId); // Loads from "{tenantId}.json" blob
learnUsers("John Doe");
learnOrganizations("ACME Corp");
await flushLearnedData(tenantId); // Saves to "{tenantId}.json" blob

// Switch to another tenant
const anotherTenantId = "tenant2";
await learnedCache.loadCache(anotherTenantId); // Loads from "tenant2.json"
learnUsers("Jane Smith");
await flushLearnedData(anotherTenantId); // Saves to "tenant2.json"

Recommendations:

Use a single container for all tenants (e.g., "named-entities")
Each tenant's data is stored as a separate blob: {tenantId}.json
Validate tenant ID before using it
Implement proper error handling for tenant switching
The blob will be created if it doesn't exist, updated if it does

API

`extractNamedEntities(content, topN, options)`

Extracts top N named entities from the given content.

Parameters:

content (string): The text content to analyze
topN (number): Number of top entities to return (default: 10)
options (ExtractionOptions): Optional configuration object
- people (boolean): Include people names (default: true)
- organizations (boolean): Include organizations (default: true)
- locations (boolean): Include locations (default: true)
- emails (boolean): Include email addresses (default: true)
- normalize (boolean): Normalize entities for comparison (default: true)

Returns: NamedEntity[] - Array of entities sorted by frequency

`extractPeople(content, topN)`

Extract only person names from the content.

Parameters:

content (string): The text content to analyze
topN (number): Number of top people to return (default: 10)

Returns: NamedEntity[]

`extractOrganizations(content, topN)`

Extract only organization names from the content.

Parameters:

content (string): The text content to analyze
topN (number): Number of top organizations to return (default: 10)

Returns: NamedEntity[]

`extractLocations(content, topN)`

Extract only location names from the content.

Parameters:

content (string): The text content to analyze
topN (number): Number of top locations to return (default: 10)

Returns: NamedEntity[]

`extractEmails(content, topN)`

Extract only email addresses from the content.

Parameters:

content (string): The text content to analyze
topN (number): Number of top emails to return (default: 10)

Returns: NamedEntity[]

New API Functions

`detectEntities(content)`

Detect entity counts in content without extracting actual entities.

Parameters:

content (string): The text content to analyze

Returns: EntityCounts - Object with counts of each entity type

`extractEntities(content, options)`

Extract entities with limits per type.

Parameters:

content (string): The text content to analyze
options (ExtractOptions): Extraction options with limits per type
- maxPeople (number, default: 10): Maximum people to extract
- maxOrganizations (number, default: 10): Maximum organizations to extract
- maxLocations (number, default: 10): Maximum locations to extract
- maxEmails (number, default: 10): Maximum emails to extract
- normalize (boolean, default: true): Normalize text for comparison

Returns: NamedEntity[] - Array of entities with limits per type applied

Learning Functions

`learnUsers(name)`

Learn user names to filter them out from future extractions.

Parameters:

name (string | string[]): User name(s) to learn

Throws: Error if Azure Storage is not configured

`learnEmployees(name)` (Deprecated)

Deprecated: Use learnUsers() instead.

Parameters:

name (string | string[]): Employee name(s) to learn

Throws: Error if Azure Storage is not configured

`learnOrganizations(name)`

Learn organization names to filter them out from future extractions.

Parameters:

name (string | string[]): Organization name(s) to learn

Throws: Error if Azure Storage is not configured

`learnEntities(entityType, name)`

Generic function to learn entities of any supported type.

Parameters:

entityType (EntityType): Type of entity ('PERSON' or 'ORGANIZATION')
name (string | string[]): Entity name(s) to learn

Throws: Error if Azure Storage is not configured

`importLearnedData(jsonData)`

Import learned entities from external JSON data.

Parameters:

jsonData (LearnedData): Learned data object with PERSON and/or ORGANIZATION arrays

`exportLearnedData()`

Export current learned entities as JSON.

Returns: LearnedData - Learned data object

`flushLearnedData(tenantId?)`

Force immediate save of learned data to Azure Blob Storage.

Parameters:

tenantId (string, optional): Tenant ID to use as blob name. If provided, blob will be named "{tenantId}.json" in the container. If omitted, uses default blob name or tenantId set via learnedCache.setTenantId().

Returns: Promise<void>

Throws: Error if Azure Storage is not configured

`isLearningConfigured()`

Check if Azure Storage is configured.

Returns: boolean - True if Azure Storage is properly configured

Types

interface NamedEntity {
  text: string; // The entity text
  type: string; // Entity type: 'PERSON', 'ORGANIZATION', 'LOCATION', 'EMAIL', or 'OTHER'
  count: number; // Frequency count in the text
}

interface ExtractionOptions {
  people?: boolean;
  organizations?: boolean;
  locations?: boolean;
  emails?: boolean;
  normalize?: boolean;
}

type EntityType = "PERSON" | "ORGANIZATION" | "LOCATION" | "EMAIL";

interface LearnedData {
  PERSON?: string[];
  ORGANIZATION?: string[];
}

interface EntityCounts {
  PERSON: number;
  ORGANIZATION: number;
  LOCATION: number;
  EMAIL: number;
}

interface ExtractOptions {
  maxPeople?: number;
  maxOrganizations?: number;
  maxLocations?: number;
  maxEmails?: number;
  normalize?: boolean;
}

How It Works

This package uses Natural Language Processing (NLP), not predefined lists, to identify named entities in text.

Conceptual Approach

The extraction combines two complementary NLP libraries:

Compromise - A rule-based NLP library that uses:
- Linguistic patterns and heuristics
- Part-of-speech tagging
- Context analysis to identify entity types
- Pattern matching for names, organizations, and places
Wink-NLP - An NLP library with pre-trained models that provides:
- Statistical models trained on English text
- Entity recognition based on language patterns
- Additional coverage for entities that Compromise might miss

Key Point: This is NOT a static list-based approach. The libraries dynamically analyze text structure, grammar, and context to identify entities on the fly.

How Entities Are Identified

People: Identified by capitalization patterns, titles (Mr., Dr.), positional context, and name structures
Organizations: Recognized through suffixes (Inc., Corp.), capitalization, and contextual clues
Locations: Detected via geographical keywords, proper nouns, and known place patterns
Email Addresses: Extracted using RFC 5322 compliant regex patterns for high accuracy

The dual-library approach increases coverage by combining results from both engines, with frequency counting for ranking. Email extraction uses regex pattern matching for maximum accuracy.

Limitations

Language Support

✅ English only - Both libraries are specifically designed for English text
❌ Not suitable for other languages - The models and patterns are English-specific
⚠️ Multi-language names in English text - Can extract names from any country/culture (e.g., "Xi Jinping", "Angela Merkel") as long as the surrounding text is in English

Country/Region Suitability

Q: Is it good for any country?

✅ Works for English-language content worldwide
✅ Can extract international names mentioned in English text
❌ Not suitable for non-English content (Chinese, Arabic, Spanish, etc.)
⚠️ Best accuracy with Western naming conventions

Technical Limitations

Rule-based approach: May miss unconventional or creative entity names
Context-dependent: Accuracy varies with text quality and structure
No deep learning: Uses pattern matching rather than neural networks
Lite model: Uses a lightweight model for speed, trading some accuracy
Ambiguity: May misclassify entities in ambiguous contexts (e.g., "Apple" as fruit vs. company)
Informal text: Lower accuracy on social media, slang, or poorly formatted text
New/emerging entities: May not recognize very recent organizations or people until patterns emerge

Performance Considerations

Fast and lightweight (good for real-time applications)
Lower memory footprint compared to transformer-based models
Trade-off: Speed and size vs. accuracy of large language models

When to Use This Package

✅ Good for:

English news articles, blog posts, formal documents
Quick entity extraction without heavy dependencies
Real-time applications requiring low latency
Applications where approximate results are acceptable
Processing large volumes of text efficiently

❌ Not ideal for:

Non-English content
Highly specialized domains (medical, legal) requiring precision
Applications requiring 99%+ accuracy
Understanding context or relationships between entities
Languages other than English

Dependencies

compromise (^14.10.0) - Fast, lightweight NLP library for person-name extraction
wink-nlp (^1.14.2) - Broader NER coverage (people/orgs/locations) with good performance
wink-eng-lite-web-model (^1.5.0) - English language model for Wink-NLP
@azure/storage-blob (^12.17.0) - Azure Blob Storage client for persistent learned data

License

This is proprietary software for private use only.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

unified-ner

Features

Installation

Usage

Basic Usage

Detect Mode - Count Entities by Type

Extract Mode - Get Actual Entities with Limits

Legacy Mode - Extract Top N Entities

Extract Specific Entity Types

Advanced Options

Learning Functionality

Key Use Case: Email Processing

Setup

Learning Functions

learnUsers(name)

learnEmployees(name) (Deprecated)

learnOrganizations(name)

learnEntities(entityType, name)

Data Management

importLearnedData(jsonData)

exportLearnedData()

flushLearnedData(tenantId?)

Example Workflow

Important Notes

Configuration Check

Implementation Recommendations

Daily Cache Refresh

Multi-Tenant Best Practices

API

extractNamedEntities(content, topN, options)

extractPeople(content, topN)

extractOrganizations(content, topN)

extractLocations(content, topN)

extractEmails(content, topN)

New API Functions

detectEntities(content)

extractEntities(content, options)

Learning Functions

learnUsers(name)

learnEmployees(name) (Deprecated)

learnOrganizations(name)

learnEntities(entityType, name)

importLearnedData(jsonData)

exportLearnedData()

flushLearnedData(tenantId?)

isLearningConfigured()

Types

How It Works

Conceptual Approach

How Entities Are Identified

Limitations

Language Support

Country/Region Suitability

Technical Limitations

Performance Considerations

When to Use This Package

Dependencies

License

entity-extractor

`learnUsers(name)`

`learnEmployees(name)` (Deprecated)

`learnOrganizations(name)`

`learnEntities(entityType, name)`

`importLearnedData(jsonData)`

`exportLearnedData()`

`flushLearnedData(tenantId?)`

`extractNamedEntities(content, topN, options)`

`extractPeople(content, topN)`

`extractOrganizations(content, topN)`

`extractLocations(content, topN)`

`extractEmails(content, topN)`

`detectEntities(content)`

`extractEntities(content, options)`

`learnUsers(name)`

`learnEmployees(name)` (Deprecated)

`learnOrganizations(name)`

`learnEntities(entityType, name)`

`importLearnedData(jsonData)`

`exportLearnedData()`

`flushLearnedData(tenantId?)`

`isLearningConfigured()`