@nlci/core

v0.1.0

Published

5 months ago

Core NLCI engine with LSH-based code similarity detection

0High
0Medium
0Low

iamthegreatdestroyer

@nlci/core

Neural-LSH Code Intelligence - Sub-linear code similarity detection engine

Features

O(1) Query Time: Find similar code in constant time using Locality-Sensitive Hashing
O(n) Indexing: Efficient codebase indexing with neural embeddings
Multi-Probe LSH: Improved recall without sacrificing speed
Clone Detection: Type-1 through Type-4 code clone identification
Language Agnostic: Support for 20+ programming languages

Installation

npm install @nlci/core
# or
pnpm add @nlci/core
# or
yarn add @nlci/core

For neural embedding support, also install the ONNX runtime:

npm install onnxruntime-node

Quick Start

import { NLCIEngine } from '@nlci/core';

// Create engine with default config
const engine = new NLCIEngine();

// Index some code
await engine.indexCode(
  `
function add(a: number, b: number): number {
  return a + b;
}
`,
  'math.ts'
);

await engine.indexCode(
  `
function sum(x: number, y: number): number {
  return x + y;
}
`,
  'utils.ts'
);

// Query for similar code
const results = await engine.query(`
function addition(n1: number, n2: number): number {
  return n1 + n2;
}
`);

console.log(results.clones); // Found similar functions!

API Reference

NLCIEngine

The main entry point for code similarity detection.

const engine = new NLCIEngine(config?: Partial<NLCIConfig>);

Configuration

interface NLCIConfig {
  lsh: {
    numTables: number; // Default: 20 (L parameter)
    numBits: number; // Default: 12 (K parameter)
    dimension: number; // Default: 384 (embedding dimension)
    multiProbe: boolean; // Default: true
  };
  embedding: {
    modelPath: string; // Path to ONNX model
    batchSize: number; // Default: 32
  };
  parser: {
    minBlockSize: number; // Default: 10 tokens
    maxBlockSize: number; // Default: 1000 tokens
  };
}

Methods

`indexCode(code, filePath, language?)`

Parses and indexes code into the LSH index.

const blocks = await engine.indexCode(sourceCode, 'file.ts', 'typescript');

`query(code, options?)`

Finds similar code blocks.

const results = await engine.query(code, {
  maxResults: 10,
  minSimilarity: 0.8,
  cloneTypes: ['type-2', 'type-3'],
});

`findSimilar(blockId, options?)`

Finds blocks similar to an already-indexed block.

const results = await engine.findSimilar('block-id');

`findAllClones(options?)`

Finds all clone clusters in the index.

const clusters = await engine.findAllClones();

`getStats()`

Returns index statistics.

const stats = engine.getStats();
// { totalBlocks, totalQueries, avgQueryTime, tableDistribution, ... }

Clone Types

| Type | Description | Similarity | | ------ | -------------------------------------------------- | ---------- | | Type-1 | Exact clones (whitespace/comment differences only) | ≥99% | | Type-2 | Parameterized clones (renamed identifiers) | 95-99% | | Type-3 | Near-miss clones (statements added/removed) | 85-95% | | Type-4 | Semantic clones (same logic, different syntax) | 70-85% |

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        NLCIEngine                            │
├──────────────────┬──────────────────┬───────────────────────┤
│   Code Parser    │ Embedding Model  │    Query Engine       │
│   (Tree-sitter)  │   (ONNX/Mock)    │                       │
├──────────────────┴──────────────────┴───────────────────────┤
│                       LSH Index                              │
│  ┌─────────┬─────────┬─────────┬─────────┬─────────────────┐ │
│  │ Table 0 │ Table 1 │ Table 2 │   ...   │ Table L-1       │ │
│  └─────────┴─────────┴─────────┴─────────┴─────────────────┘ │
│                    Random Hyperplanes                        │
└─────────────────────────────────────────────────────────────┘

LSH Algorithm

The LSH index uses random hyperplane projection:

Hash Generation: Each table has K random hyperplanes
Hash Computation: h(v) = sign(hyperplane · v) produces K-bit hash
Multi-Probe: Query probes neighboring buckets (Hamming distance ≤ 2)
Candidate Retrieval: Union of candidates from all L tables

Complexity:

Index: O(L) per block
Query: O(L × 2^K × probe_count) ≈ O(1) with typical parameters

Advanced Usage

Custom Parser

import { NLCIEngine, type CodeParser } from '@nlci/core';

class TreeSitterParser implements CodeParser {
  supportedLanguages = ['typescript', 'javascript'] as const;

  parse(source, filePath, language) {
    // Use tree-sitter for parsing
    return { blocks: [...], errors: [], duration: 0 };
  }
}

const engine = new NLCIEngine({}, {
  parser: new TreeSitterParser(),
});

Custom Embedding Model

import { NLCIEngine, type EmbeddingModel } from '@nlci/core';

class ONNXEmbedding implements EmbeddingModel {
  dimension = 384;

  async embed(code: string) {
    // Use ONNX runtime
    return new Float32Array(384);
  }

  async embedBatch(codes: string[]) {
    return Promise.all(codes.map((c) => this.embed(c)));
  }
}

const engine = new NLCIEngine(
  {},
  {
    embeddingModel: new ONNXEmbedding(),
  }
);

Persistence

// Save index to storage
await engine.save();

// Load index from storage
const loaded = await engine.load();

Performance

Benchmarks on MacBook Pro M1:

| Operation | Time | Complexity | | ---------------- | ------ | ---------- | | Index 1 block | ~5ms | O(L) | | Query | ~0.5ms | O(1) | | Index 10K blocks | ~50s | O(n) | | Query 10K blocks | ~0.5ms | O(1) |

Memory usage: ~100 bytes per indexed block (excluding embeddings)

Contributing

See CONTRIBUTING.md for guidelines.

License

AGPL-3.0-or-later - See LICENSE

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@nlci/core

Features

Installation

Quick Start

API Reference

NLCIEngine

Configuration

Methods

indexCode(code, filePath, language?)

query(code, options?)

findSimilar(blockId, options?)

findAllClones(options?)

getStats()

Clone Types

Architecture

LSH Algorithm

Advanced Usage

Custom Parser

Custom Embedding Model

Persistence

Performance

Contributing

License

`indexCode(code, filePath, language?)`

`query(code, options?)`

`findSimilar(blockId, options?)`

`findAllClones(options?)`

`getStats()`