npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

unified-ner

v1.0.4

Published

A lightweight TypeScript package for extracting top named entities from text

Readme

unified-ner

A lightweight TypeScript package for extracting top named entities from text using Compromise and Wink-NLP.

Features

  • Two Operation Modes:
    • Detect: Count entities by type (e.g., "5 people, 3 organizations")
    • Extract: Get actual entities with limits per type (e.g., top 5 people, top 5 orgs)
  • Extract named entities: People, Organizations, Locations, and Email Addresses
  • Rank entities by frequency
  • Learning functionality to filter out known entities
  • Minimal dependencies (only 2 NLP packages + Azure Storage)
  • TypeScript support with full type definitions
  • MIT licensed dependencies
  • Flexible API with customization options
  • Regex-based email extraction for high accuracy
  • Multi-tenant support with Azure Blob Storage

Installation

npm install unified-ner

Usage

Basic Usage

Detect Mode - Count Entities by Type

import { detectEntities } from "unified-ner";

const text = `
  Apple Inc. announced that Tim Cook will meet with President Biden in Washington.
  The meeting will discuss technology policy with Microsoft and Google.
  Contact us at [email protected] or [email protected] for more details.
`;

// Get counts of each entity type
const counts = detectEntities(text);
console.log(counts);
// {
//   PERSON: 2,        // Tim Cook, Biden
//   ORGANIZATION: 3,   // Apple Inc, Microsoft, Google
//   LOCATION: 1,       // Washington
//   EMAIL: 2          // [email protected], [email protected]
// }

Extract Mode - Get Actual Entities with Limits

import { extractEntities } from "unified-ner";

const text = `
  Apple Inc. announced that Tim Cook will meet with President Biden in Washington.
  The meeting will discuss technology policy with Microsoft and Google.
  Contact us at [email protected] or [email protected] for more details.
`;

// Get top 2 entities of each type
const entities = extractEntities(text, {
  maxPeople: 2,
  maxOrganizations: 2,
  maxLocations: 2,
  maxEmails: 2,
});
console.log(entities);
// [
//   { text: 'tim cook', type: 'PERSON', count: 1 },
//   { text: 'biden', type: 'PERSON', count: 1 },
//   { text: 'apple inc.', type: 'ORGANIZATION', count: 1 },
//   { text: 'microsoft', type: 'ORGANIZATION', count: 1 },
//   { text: 'washington', type: 'LOCATION', count: 1 },
//   { text: '[email protected]', type: 'EMAIL', count: 1 },
//   { text: '[email protected]', type: 'EMAIL', count: 1 }
// ]

Legacy Mode - Extract Top N Entities

import { extractNamedEntities } from "unified-ner";

// Get top 5 named entities (legacy mode)
const entities = extractNamedEntities(text, 5);
console.log(entities);
// [
//   { text: 'apple inc.', type: 'ORGANIZATION', count: 1 },
//   { text: 'tim cook', type: 'PERSON', count: 1 },
//   { text: 'biden', type: 'PERSON', count: 1 },
//   { text: 'washington', type: 'LOCATION', count: 1 },
//   { text: 'microsoft', type: 'ORGANIZATION', count: 1 }
// ]

Extract Specific Entity Types

import {
  extractPeople,
  extractOrganizations,
  extractLocations,
  extractEmails,
} from "unified-ner";

// Extract only people
const people = extractPeople(text, 5);

// Extract only organizations
const orgs = extractOrganizations(text, 5);

// Extract only locations
const locations = extractLocations(text, 5);

// Extract only email addresses
const emails = extractEmails(text, 5);

Advanced Options

import { extractNamedEntities } from "unified-ner";

const entities = extractNamedEntities(text, 10, {
  people: true, // Include people (default: true)
  organizations: true, // Include organizations (default: true)
  locations: false, // Exclude locations
  emails: true, // Include email addresses (default: true)
  normalize: true, // Normalize text for comparison (default: true)
});

Learning Functionality

The package includes intelligent learning capabilities that help filter out known entities, returning only NEW entities from your extractions. This is particularly useful for processing emails, documents, or any content where you want to focus on previously unknown entities.

Key Use Case: Email Processing

When processing emails, you often want to extract entities but exclude the sender's name (which you already know). The learning functionality helps you:

  • Learn email senders: Use learnUsers() with the sender's name to avoid extracting them repeatedly
  • Avoid generic emails: Learn organization names to filter out generic email addresses like "[email protected]" and focus on the actual organization name "ACME Invoices"

Setup

  1. Configure Azure Storage (required for learning):

    export UNIFIED_NER_AZURE_CONNECTION_STRING="your_azure_connection_string"
    export UNIFIED_NER_AZURE_CONTAINER_NAME="your_container_name"

    Or create a .env file in your project (auto-loaded):

    UNIFIED_NER_AZURE_CONNECTION_STRING=your_azure_connection_string
    UNIFIED_NER_AZURE_CONTAINER_NAME=your_container_name
  2. Configure batch size and refresh interval in config.json:

    {
      "saveBatchSize": 10,
      "cacheRefreshIntervalHours": 24
    }

Learning Functions

learnUsers(name)

Learn user names to filter them out from future extractions.

import { learnUsers } from "unified-ner";

// Learn a single user
learnUsers("John Doe");

// Learn multiple users
learnUsers(["Jane Smith", "Bob Johnson"]);

Recommended Usage: Call learnUsers() with the sender name from each email you process. This helps avoid extracting the same sender repeatedly.

learnEmployees(name) (Deprecated)

Deprecated: Use learnUsers() instead. This function is kept for backward compatibility.

import { learnEmployees } from "unified-ner";

// This will show a deprecation warning
learnEmployees("John Doe");

learnOrganizations(name)

Learn organization names to filter them out from future extractions.

import { learnOrganizations } from "unified-ner";

// Learn a single organization
learnOrganizations("ACME Corporation");

// Learn multiple organizations
learnOrganizations(["Microsoft", "Google", "Apple"]);

learnEntities(entityType, name)

Generic function to learn entities of any supported type.

import { learnEntities } from "unified-ner";

learnEntities("PERSON", "John Doe");
learnEntities("ORGANIZATION", "ACME Corp");

Data Management

importLearnedData(jsonData)

Import learned entities from external JSON data.

import { importLearnedData } from "unified-ner";

const existingData = {
  PERSON: ["John Doe", "Jane Smith"],
  ORGANIZATION: ["ACME Corp", "Tech Inc"],
};

importLearnedData(existingData);

exportLearnedData()

Export current learned entities as JSON.

import { exportLearnedData } from "unified-ner";

const learnedData = exportLearnedData();
console.log(learnedData);
// { PERSON: ["john doe", "jane smith"], ORGANIZATION: ["acme corp"] }

flushLearnedData(tenantId?)

Force immediate save to Azure Blob Storage (bypasses batch size).

import { flushLearnedData } from "unified-ner";

// Save to default blob
await flushLearnedData();

// Save to tenant-specific blob (creates or updates "{tenantId}.json")
await flushLearnedData("tenant1");

Example Workflow

import {
  extractNamedEntities,
  learnUsers,
  learnOrganizations,
} from "unified-ner";

// Process an email
const emailContent = `
  From: John Doe <[email protected]>
  Subject: Meeting with ACME Corporation
  
  Hi team,
  
  I met with Sarah Johnson from ACME Corporation yesterday.
  We discussed the project timeline with Microsoft.
`;

// Learn the sender to avoid extracting them
learnUsers("John Doe");

// Learn known organizations
learnOrganizations(["ACME Corporation", "Microsoft"]);

// Extract entities - will exclude learned names
const newEntities = extractNamedEntities(emailContent, 10);
console.log(newEntities);
// Only returns: Sarah Johnson (new person), other unknown entities
// Excludes: John Doe, ACME Corporation, Microsoft

Important Notes

  • Case Insensitive: All learning is case-insensitive ("John Doe" and "john doe" are treated as the same)
  • Persistence: Learned data is automatically saved to Azure Blob Storage
  • Batching: Data is saved in batches (configurable via config.json) for performance
  • Fallback: If Azure Storage is not configured, learning functions will throw errors, but extraction functions work normally
  • Filtering: Only PERSON and ORGANIZATION entities are filtered; LOCATION and EMAIL entities are always returned

Configuration Check

import { isLearningConfigured } from "unified-ner";

if (isLearningConfigured()) {
  console.log("Learning functionality is available");
} else {
  console.log("Azure Storage not configured - learning disabled");
}

Implementation Recommendations

Daily Cache Refresh

For production applications, it's recommended to refresh the learned data cache periodically to ensure data consistency across multiple instances:

// CommonJS (require)
const { learnedCache } = require("unified-ner");

// ES Modules (import)
import pkg from "unified-ner";
const { learnedCache } = pkg;

// Load configuration
const config = require("./config.json");
const refreshIntervalMs = config.cacheRefreshIntervalHours * 60 * 60 * 1000;

// Set up periodic cache refresh (default: every 24 hours)
setInterval(async () => {
  try {
    await learnedCache.loadCache();
    console.log("Learned data cache refreshed from Azure Storage");
  } catch (error) {
    console.error("Failed to refresh learned data cache:", error);
  }
}, refreshIntervalMs);

// Initial load on application startup
(async () => {
  try {
    await learnedCache.loadCache();
    console.log("Initial cache loaded successfully");
  } catch (error) {
    console.error("Failed to load initial cache:", error);
  }
})();

Configuration in config.json:

  • cacheRefreshIntervalHours: 24 (recommended for production)
  • Adjust based on how frequently learned data changes
  • Lower values = more API calls to Azure Storage but fresher data

Multi-Tenant Best Practices

For multi-tenant deployments, use a single container with tenant-specific blob names:

import {
  learnUsers,
  learnOrganizations,
  flushLearnedData,
  learnedCache,
} from "unified-ner";

// Single container for all tenants (e.g., "named-entities")
// Each tenant gets their own blob named "{tenantId}.json" in this container

const tenantId = getCurrentTenantId(); // Your tenant identification logic

// Workflow: Load tenant's cache -> Learn entities -> Flush to tenant's blob
await learnedCache.loadCache(tenantId); // Loads from "{tenantId}.json" blob
learnUsers("John Doe");
learnOrganizations("ACME Corp");
await flushLearnedData(tenantId); // Saves to "{tenantId}.json" blob

// Switch to another tenant
const anotherTenantId = "tenant2";
await learnedCache.loadCache(anotherTenantId); // Loads from "tenant2.json"
learnUsers("Jane Smith");
await flushLearnedData(anotherTenantId); // Saves to "tenant2.json"

Recommendations:

  • Use a single container for all tenants (e.g., "named-entities")
  • Each tenant's data is stored as a separate blob: {tenantId}.json
  • Validate tenant ID before using it
  • Implement proper error handling for tenant switching
  • The blob will be created if it doesn't exist, updated if it does

API

extractNamedEntities(content, topN, options)

Extracts top N named entities from the given content.

Parameters:

  • content (string): The text content to analyze
  • topN (number): Number of top entities to return (default: 10)
  • options (ExtractionOptions): Optional configuration object
    • people (boolean): Include people names (default: true)
    • organizations (boolean): Include organizations (default: true)
    • locations (boolean): Include locations (default: true)
    • emails (boolean): Include email addresses (default: true)
    • normalize (boolean): Normalize entities for comparison (default: true)

Returns: NamedEntity[] - Array of entities sorted by frequency

extractPeople(content, topN)

Extract only person names from the content.

Parameters:

  • content (string): The text content to analyze
  • topN (number): Number of top people to return (default: 10)

Returns: NamedEntity[]

extractOrganizations(content, topN)

Extract only organization names from the content.

Parameters:

  • content (string): The text content to analyze
  • topN (number): Number of top organizations to return (default: 10)

Returns: NamedEntity[]

extractLocations(content, topN)

Extract only location names from the content.

Parameters:

  • content (string): The text content to analyze
  • topN (number): Number of top locations to return (default: 10)

Returns: NamedEntity[]

extractEmails(content, topN)

Extract only email addresses from the content.

Parameters:

  • content (string): The text content to analyze
  • topN (number): Number of top emails to return (default: 10)

Returns: NamedEntity[]

New API Functions

detectEntities(content)

Detect entity counts in content without extracting actual entities.

Parameters:

  • content (string): The text content to analyze

Returns: EntityCounts - Object with counts of each entity type

extractEntities(content, options)

Extract entities with limits per type.

Parameters:

  • content (string): The text content to analyze
  • options (ExtractOptions): Extraction options with limits per type
    • maxPeople (number, default: 10): Maximum people to extract
    • maxOrganizations (number, default: 10): Maximum organizations to extract
    • maxLocations (number, default: 10): Maximum locations to extract
    • maxEmails (number, default: 10): Maximum emails to extract
    • normalize (boolean, default: true): Normalize text for comparison

Returns: NamedEntity[] - Array of entities with limits per type applied

Learning Functions

learnUsers(name)

Learn user names to filter them out from future extractions.

Parameters:

  • name (string | string[]): User name(s) to learn

Throws: Error if Azure Storage is not configured

learnEmployees(name) (Deprecated)

Deprecated: Use learnUsers() instead.

Parameters:

  • name (string | string[]): Employee name(s) to learn

Throws: Error if Azure Storage is not configured

learnOrganizations(name)

Learn organization names to filter them out from future extractions.

Parameters:

  • name (string | string[]): Organization name(s) to learn

Throws: Error if Azure Storage is not configured

learnEntities(entityType, name)

Generic function to learn entities of any supported type.

Parameters:

  • entityType (EntityType): Type of entity ('PERSON' or 'ORGANIZATION')
  • name (string | string[]): Entity name(s) to learn

Throws: Error if Azure Storage is not configured

importLearnedData(jsonData)

Import learned entities from external JSON data.

Parameters:

  • jsonData (LearnedData): Learned data object with PERSON and/or ORGANIZATION arrays

exportLearnedData()

Export current learned entities as JSON.

Returns: LearnedData - Learned data object

flushLearnedData(tenantId?)

Force immediate save of learned data to Azure Blob Storage.

Parameters:

  • tenantId (string, optional): Tenant ID to use as blob name. If provided, blob will be named "{tenantId}.json" in the container. If omitted, uses default blob name or tenantId set via learnedCache.setTenantId().

Returns: Promise<void>

Throws: Error if Azure Storage is not configured

isLearningConfigured()

Check if Azure Storage is configured.

Returns: boolean - True if Azure Storage is properly configured

Types

interface NamedEntity {
  text: string; // The entity text
  type: string; // Entity type: 'PERSON', 'ORGANIZATION', 'LOCATION', 'EMAIL', or 'OTHER'
  count: number; // Frequency count in the text
}

interface ExtractionOptions {
  people?: boolean;
  organizations?: boolean;
  locations?: boolean;
  emails?: boolean;
  normalize?: boolean;
}

type EntityType = "PERSON" | "ORGANIZATION" | "LOCATION" | "EMAIL";

interface LearnedData {
  PERSON?: string[];
  ORGANIZATION?: string[];
}

interface EntityCounts {
  PERSON: number;
  ORGANIZATION: number;
  LOCATION: number;
  EMAIL: number;
}

interface ExtractOptions {
  maxPeople?: number;
  maxOrganizations?: number;
  maxLocations?: number;
  maxEmails?: number;
  normalize?: boolean;
}

How It Works

This package uses Natural Language Processing (NLP), not predefined lists, to identify named entities in text.

Conceptual Approach

The extraction combines two complementary NLP libraries:

  1. Compromise - A rule-based NLP library that uses:

    • Linguistic patterns and heuristics
    • Part-of-speech tagging
    • Context analysis to identify entity types
    • Pattern matching for names, organizations, and places
  2. Wink-NLP - An NLP library with pre-trained models that provides:

    • Statistical models trained on English text
    • Entity recognition based on language patterns
    • Additional coverage for entities that Compromise might miss

Key Point: This is NOT a static list-based approach. The libraries dynamically analyze text structure, grammar, and context to identify entities on the fly.

How Entities Are Identified

  • People: Identified by capitalization patterns, titles (Mr., Dr.), positional context, and name structures
  • Organizations: Recognized through suffixes (Inc., Corp.), capitalization, and contextual clues
  • Locations: Detected via geographical keywords, proper nouns, and known place patterns
  • Email Addresses: Extracted using RFC 5322 compliant regex patterns for high accuracy

The dual-library approach increases coverage by combining results from both engines, with frequency counting for ranking. Email extraction uses regex pattern matching for maximum accuracy.

Limitations

Language Support

  • English only - Both libraries are specifically designed for English text
  • Not suitable for other languages - The models and patterns are English-specific
  • ⚠️ Multi-language names in English text - Can extract names from any country/culture (e.g., "Xi Jinping", "Angela Merkel") as long as the surrounding text is in English

Country/Region Suitability

Q: Is it good for any country?

  • ✅ Works for English-language content worldwide
  • ✅ Can extract international names mentioned in English text
  • ❌ Not suitable for non-English content (Chinese, Arabic, Spanish, etc.)
  • ⚠️ Best accuracy with Western naming conventions

Technical Limitations

  • Rule-based approach: May miss unconventional or creative entity names
  • Context-dependent: Accuracy varies with text quality and structure
  • No deep learning: Uses pattern matching rather than neural networks
  • Lite model: Uses a lightweight model for speed, trading some accuracy
  • Ambiguity: May misclassify entities in ambiguous contexts (e.g., "Apple" as fruit vs. company)
  • Informal text: Lower accuracy on social media, slang, or poorly formatted text
  • New/emerging entities: May not recognize very recent organizations or people until patterns emerge

Performance Considerations

  • Fast and lightweight (good for real-time applications)
  • Lower memory footprint compared to transformer-based models
  • Trade-off: Speed and size vs. accuracy of large language models

When to Use This Package

Good for:

  • English news articles, blog posts, formal documents
  • Quick entity extraction without heavy dependencies
  • Real-time applications requiring low latency
  • Applications where approximate results are acceptable
  • Processing large volumes of text efficiently

Not ideal for:

  • Non-English content
  • Highly specialized domains (medical, legal) requiring precision
  • Applications requiring 99%+ accuracy
  • Understanding context or relationships between entities
  • Languages other than English

Dependencies

  • compromise (^14.10.0) - Fast, lightweight NLP library for person-name extraction
  • wink-nlp (^1.14.2) - Broader NER coverage (people/orgs/locations) with good performance
  • wink-eng-lite-web-model (^1.5.0) - English language model for Wink-NLP
  • @azure/storage-blob (^12.17.0) - Azure Blob Storage client for persistent learned data

License

Copyright (c) 2025 Unified-NER. All rights reserved.

This is proprietary software for private use only.

entity-extractor