qtools-graph-forge-core

v1.5.0

Published

a month ago

Shared infrastructure for building queryable Neo4j graph databases from source data — container management, loading, embedding, search, bridges, and askMilo tool generation

Downloads

0High
0Medium
0Low

tqwhite

neo4j graph database embedding voyage docker graphForge

graph-forge-core

The engine that turns source data files into queryable Neo4j graph databases with semantic search.

What It Does

Input: A source data file in any format (JSON, RDF/XML, TSV, text documents, CRM exports) plus a parser that converts it to a standard node/edge format.

Transformation: Loads nodes and edges into a Neo4j container, generates Voyage AI vector embeddings for semantic search, creates BM25 fulltext indexes for keyword search, builds cross-standard bridge edges between data sources, and generates askMilo provider.json so the graph is immediately queryable through the web UI.

Infrastructure: Each graph runs in a Docker container with Neo4j Community Edition. The containerManager creates containers on demand (docker run with --restart unless-stopped), scans for available ports starting at 7700 (using docker ps to avoid collisions), and manages the full lifecycle (start, stop, status, destroy). Docker Desktop must be running. Each container uses ~300-700MB RAM.

Standalone containers (for development/testing) are prefixed gf_ (e.g., gf_edmatrix). Production data goes into the shared rag_DataModelExplorer container via the --target=dme export path.

Output: A running Neo4j graph database with:

Labeled nodes with properties, relationships, and vector embeddings
Per-source fulltext and vector indexes for scoped search
Cross-standard bridge edges (similarity-based and rule-based)
askMilo tool definitions (search, explore, stats, raw Cypher, history)
A timestamped audit trail of every operation

The only custom code per data source is a parser (~50-100 lines) that reads the source format and produces standard node objects. Everything else — container management, loading, embedding, indexing, searching, bridge building, provider generation — is handled by this library.

This is NOT a CLI module — it lives at cli/graph-forge-core/ (peer of cli/lib.d/, not inside it) and is require()'d by forge CLIs via relative path.

Modules

| Module | Purpose | Used By | |--------|---------|---------| | forgeRunner.js | Shared CLI engine — config-driven dispatcher for all forge actions | Every forge CLI (30-line config → full CLI) | | containerManager.js | Docker Neo4j lifecycle: create, start, stop, destroy, port scanning | forgeRunner, rebuildDme | | graphLoader.js | Parser output → Neo4j nodes/edges with dual labels + GraphSource root | forgeRunner | | graphEmbedder.js | Voyage-4 vector embeddings + per-source fulltext/vector index creation | forgeRunner | | graphSearchTool.js | Generic hybrid BM25 + vector search, stats, explore, rawCypher, history | forgeRunner, generated search CLIs | | bridgeBuilder.js | Execute bridge spec JSON files to create cross-standard edges | forgeRunner | | providerGenerator.js | Auto-generate askMilo provider.json + search CLI from graph schema | forgeRunner | | graphHistory.js | Timestamped audit trail — records every load, embed, bridge, wipe | graphLoader, graphEmbedder, bridgeBuilder | | voyageClient.js | Single Voyage AI embedding client (voyage-4, 1024d) — THE source of truth | graphEmbedder, bridgeBuilder, graphSearchTool |

forgeRunner — The Shared Engine

Every forge CLI is ~30 lines of config that calls forgeRunner(config). forgeRunner handles:

Bootstrap (config loading, process.global, targets.ini)
Help text generation (dynamic from config)
Command dispatch (all standard actions)
Connection resolution (standalone container or external target)
Super-label application post-load

// Example: forge-edmatrix/forgeEdMatrix.js (entire file)
const forgeRunner = require('../../graph-forge-core/lib/forgeRunner');

forgeRunner({
    graphName: 'EdMatrix',
    superLabel: 'EdMatrix',
    toolPrefix: 'edmatrix',
    cliName: 'forgeEdMatrix',
    displayName: 'EdMatrix education standards graph',
    description: '35 standards across categories, layers, organizations',
    parser: require('./lib/parser'),
    sourceFileName: 'edmatrix-data.json',
    sourceConfigKey: 'forge-edmatrix',
    hasBridges: true,
    hasExport: true,
    forgeDir: __dirname
});

forgeRunner Config Options

| Option | Type | Required | Description | |--------|------|----------|-------------| | graphName | string | yes | Internal name, used as _source and GraphSource name | | superLabel | string | yes | Neo4j label applied to all nodes, used for per-source indexes | | toolPrefix | string | yes | Prefix for generated askMilo tools (must be globally unique) | | cliName | string | no | Display name in help text (default: forge{graphName}) | | displayName | string | no | One-line description for help header | | description | string | no | Full description for provider.json | | parser | function | yes | Parser module: (sourcePath, options, callback) => {} | | sourceFileName | string | no | Default source data filename | | sourceConfigKey | string | no | Key in [sourceData] config section | | sourceParamName | string | no | CLI param name override (default: 'source') | | sourceRequired | boolean | no | If true, source path must be provided (no default) | | hasBridges | boolean | no | Enable -bridge command | | hasExport | boolean | no | Enable -export command (default: true) | | forgeDir | string | yes | __dirname of the forge CLI (for resolving bridge specs, assets) | | postLoadHook | function | no | Called after graphLoader: (parseResult, connInfo, callback) => {} | | additionalLoadOptions | function | no | Returns extra parser options: (getVal, sourceDir) => ({}) | | additionalSourceFiles | object | no | Extra source files: { resolutionMapPath: 'file.tsv' } |

voyageClient — Single Embedding Client

All Voyage AI calls go through this module. One model, one dimension, one place to change.

const voyageClient = require('./voyageClient');
// voyageClient.MODEL = 'voyage-4'
// voyageClient.DIMENSION = 1024
// voyageClient.BATCH_SIZE = 20
voyageClient.embed(['text1', 'text2'], apiKey, (err, embeddings) => {});

Key Conventions

Super-Labels and Indexes

Every source must define a super-label. Vector and fulltext indexes are created ON the super-label, not on :ForgedNode. This ensures bridge similarity queries search only the target source's embeddings.

| Source | Super-Label | Vector Index | Fulltext Index | |--------|------------|--------------|----------------| | CEDS | :CEDS | ceds_vector | ceds_fulltext | | SIF | :SifModel | sif_vector | sif_fulltext | | EdMatrix | :EdMatrix | edmatrix_vector | edmatrix_fulltext | | CareerStories | :CareerStoryModel | careerstories_vector | careerstories_fulltext | | Himed | :HimedModel | himed_vector | himed_fulltext |

Node Labeling

Every node gets three labels:

:ForgedNode — universal marker
Source-specific (:CedsProperty, :SifField, :EdStandard)
Super-label (:CEDS, :SifModel, :EdMatrix)

Required Node Properties

Every node: _id (unique within source), _source (matches GraphSource name), name, description.

Parser Contract — Complete Reference

The parser is the ONLY custom code per data source. It reads a source file and returns a standard array of node objects. The graphLoader handles everything else.

Signature

module.exports = (sourcePath, options, callback) => {
    // sourcePath: absolute path to the source data file
    // options:    object with parser-specific settings (chunkStrategy, resolutionMapPath, etc.)
    // callback:   (error, result) — error is a string (truthy = failure), result is { nodes, metadata }

    callback('', {
        nodes: [ /* array of node objects */ ],
        metadata: { version: '1.0', sourceFormat: 'json' }
    });
};

The Node Object

Each element in the nodes array must have this shape:

{
    id: 'unique-within-source',       // REQUIRED. String. Becomes _id in Neo4j.
                                       // Must be unique within THIS source (not globally).
                                       // Convention: 'type-slugified-name' (e.g., 'standard-ceds', 'type-organizational')

    label: 'EdStandard',              // REQUIRED. String. Becomes a Neo4j node label.
                                       // Use PascalCase. This is the source-specific label
                                       // (the super-label and :ForgedNode are added automatically).

    properties: {
        name: 'CEDS',                 // REQUIRED. String. Human-readable display name.
                                       // Used by -explore, displayed in search results.

        description: 'Common vocab…', // REQUIRED. String. Searchable text.
                                       // Indexed for BM25 fulltext search.
                                       // Embedded as a vector for semantic search.
                                       // Make this as descriptive as possible — it drives search quality.

        // Any additional properties are preserved as-is on the Neo4j node:
        url: 'http://ceds.ed.gov/',
        org: 'US Ed',
        types: 'Organizational, Personal, Event',
        // Numbers, booleans, arrays of strings — all valid Neo4j property types.
    },

    edges: [                           // OPTIONAL. Array of outgoing relationships.
        {
            type: 'HAS_TYPE',         // REQUIRED. String. Neo4j relationship type.
                                       // Convention: UPPER_SNAKE_CASE.

            targetId: 'type-organizational',  // REQUIRED. String. Must match the `id` of another
                                               // node in this same nodes array.

            targetLabel: 'DataCategory',      // OPTIONAL. String. If the target node doesn't exist
                                               // as a top-level node, the loader auto-creates a
                                               // minimal node with this label. Useful for creating
                                               // category/type nodes from edge references.

            properties: {}             // OPTIONAL. Object. Properties set on the relationship.
        }
    ]
}

What the Loader Does With Each Node

Creates the node with labels :ForgedNode:YourLabel (super-label added post-load)
Sets _id from node.id
Sets _source from the forge's graphName config
Copies all node.properties as Neo4j properties
For each edge, matches source and target by _id + _source and creates the relationship
If an edge's targetId doesn't match any top-level node, creates a minimal target node with the targetLabel

Edge Auto-Creation

You don't need to create both ends of a relationship as top-level nodes. If your data has standards with types, you can define type nodes inline via edges:

// This standard references a DataCategory that may not exist yet
{
    id: 'standard-ceds',
    label: 'EdStandard',
    properties: { name: 'CEDS', description: '...' },
    edges: [
        { type: 'HAS_TYPE', targetId: 'type-organizational', targetLabel: 'DataCategory' }
    ]
}

If type-organizational doesn't appear as a top-level node, the loader creates:

(:ForgedNode:DataCategory { _id: 'type-organizational', _source: 'EdMatrix', name: 'type-organizational' })

But it's better to create explicit top-level nodes with proper names and descriptions:

// Explicit node — better for search and display
{ id: 'type-organizational', label: 'DataCategory', properties: { name: 'Organizational', description: 'Data category: Organizational' }, edges: [] },
// Then reference it from the standard
{ id: 'standard-ceds', label: 'EdStandard', properties: { name: 'CEDS', description: '...' }, edges: [
    { type: 'HAS_TYPE', targetId: 'type-organizational', targetLabel: 'DataCategory' }
] }

The Metadata Object

metadata: {
    version: '1.0',           // Version of the source data (shown in GraphSource node)
    sourceFormat: 'json'       // Format identifier (json, rdf, tsv, txt, etc.)
}

Rules

Parser NEVER touches Neo4j. No require('neo4j-driver'). No database connections. Pure data transformation.
Parser NEVER imports graph-forge-core modules. It's standalone — testable without Docker, without a running database.
Use callback pattern. callback(errorString, result). Error is a truthy string on failure, empty string on success.
No async/await. Use callbacks per TQ coding standards.
Every node must have id, label, properties.name, properties.description. The loader and embedder depend on these.
IDs must be unique within the source. Use slugified compound names: 'standard-ceds', 'type-organizational', 'layer-data-dictionary'.
Edge targetIds must match an id in the same source. Cross-source edges are built by bridge specs, not parsers.

Common Patterns

Deduplication: When multiple source records reference the same entity (e.g., many standards published by "1EdTech"), use a Set to track what you've already emitted:

const seenOrgs = new Set();
rawData.forEach((entry) => {
    const orgId = `org-${slugify(entry.org)}`;
    if (!seenOrgs.has(orgId)) {
        seenOrgs.add(orgId);
        nodes.push({ id: orgId, label: 'Organization', properties: { name: entry.org, description: `Publisher: ${entry.org}` }, edges: [] });
    }
    // Edge from standard to org
    standardNode.edges.push({ type: 'PUBLISHED_BY', targetId: orgId, targetLabel: 'Organization' });
});

Slugification: Convert names to URL-safe lowercase IDs:

const slugify = (str) => str.toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/(^-|-$)/g, '');

Testing: Run the parser standalone to verify counts before loading:

const parser = require('./lib/parser');
parser('/path/to/data.json', {}, (err, result) => {
    console.log(`Nodes: ${result.nodes.length}`);
    const labels = {};
    result.nodes.forEach(n => { labels[n.label] = (labels[n.label] || 0) + 1; });
    console.log('Labels:', labels);
});

Real Parser Examples

| Parser | Lines | Source Format | Nodes Produced | Complexity | |--------|-------|-------------|----------------|------------| | forge-edmatrix | 87 | JSON array of standards | 83 (standards + categories + layers + orgs + formats) | Simple | | forge-ceds-rdf | 200 | RDF/XML ontology | 23,238 (classes + properties + option sets + values) | Complex (XML parsing) | | forge-sif-tsv | 614 | TSV specification + resolution map | 23,005 (objects + fields + types + codesets + XML elements) | Complex (RefId resolution) | | forge-career-stories | 223 | Text documents (glob) | Variable (documents + chunks) | Medium (text chunking) | | forge-himed | 150 | Tab-delimited CRM export | ~26,000 (call reports + customers + contacts) | Medium (deduplication) |

Directory Structure

cli/
  graph-forge-core/           ← THIS LIBRARY (peer of lib.d/, not inside it)
    lib/
      forgeRunner.js          ← shared CLI engine
      containerManager.js     ← Docker Neo4j lifecycle
      graphLoader.js          ← nodes → Neo4j
      graphEmbedder.js        ← Voyage embeddings + indexes
      graphSearchTool.js      ← search, stats, explore, rawCypher, history
      bridgeBuilder.js        ← cross-standard edge creation
      providerGenerator.js    ← askMilo provider.json generation
      graphHistory.js         ← audit trail
      voyageClient.js         ← Voyage AI client (single source of truth)
    package.json              ← dependencies (neo4j-driver, qtools-*)
    node_modules/
  lib.d/
    forge-edmatrix/           ← 30 lines + parser
    forge-ceds-rdf/           ← 30 lines + parser (will be refactored)
    forge-sif-tsv/            ← 30 lines + parser (will be refactored)
    forge-career-stories/     ← 30 lines + parser (will be refactored)
    forge-himed/              ← 30 lines + parser (will be refactored)
    rebuild-dme/              ← orchestration script
    rebuild-himed/            ← orchestration script

How Forge CLIs Require This Library

From cli/lib.d/forge-xxx/forgeXxx.js:

const forgeRunner = require('../../graph-forge-core/lib/forgeRunner');

The path ../../graph-forge-core/ goes up from lib.d/forge-xxx/ to cli/, then into graph-forge-core/.