qtools-graph-forge-core
v1.4.6
Published
Shared infrastructure for building queryable Neo4j graph databases from source data — container management, loading, embedding, search, bridges, and askMilo tool generation
Downloads
1,238
Maintainers
Readme
graph-forge-core
The engine that turns source data files into queryable Neo4j graph databases with semantic search.
What It Does
Input: A source data file in any format (JSON, RDF/XML, TSV, text documents, CRM exports) plus a parser that converts it to a standard node/edge format.
Transformation: Loads nodes and edges into a Neo4j container, generates Voyage AI vector embeddings for semantic search, creates BM25 fulltext indexes for keyword search, builds cross-standard bridge edges between data sources, and generates askMilo provider.json so the graph is immediately queryable through the web UI.
Infrastructure: Each graph runs in a Docker container with Neo4j Community Edition. The containerManager creates containers on demand (docker run with --restart unless-stopped), scans for available ports starting at 7700 (using docker ps to avoid collisions), and manages the full lifecycle (start, stop, status, destroy). Docker Desktop must be running. Each container uses ~300-700MB RAM.
Standalone containers (for development/testing) are prefixed gf_ (e.g., gf_edmatrix). Production data goes into the shared rag_DataModelExplorer container via the --target=dme export path.
Output: A running Neo4j graph database with:
- Labeled nodes with properties, relationships, and vector embeddings
- Per-source fulltext and vector indexes for scoped search
- Cross-standard bridge edges (similarity-based and rule-based)
- askMilo tool definitions (search, explore, stats, raw Cypher, history)
- A timestamped audit trail of every operation
The only custom code per data source is a parser (~50-100 lines) that reads the source format and produces standard node objects. Everything else — container management, loading, embedding, indexing, searching, bridge building, provider generation — is handled by this library.
This is NOT a CLI module — it lives at cli/graph-forge-core/ (peer of cli/lib.d/, not inside it) and is require()'d by forge CLIs via relative path.
Modules
| Module | Purpose | Used By |
|--------|---------|---------|
| forgeRunner.js | Shared CLI engine — config-driven dispatcher for all forge actions | Every forge CLI (30-line config → full CLI) |
| containerManager.js | Docker Neo4j lifecycle: create, start, stop, destroy, port scanning | forgeRunner, rebuildDme |
| graphLoader.js | Parser output → Neo4j nodes/edges with dual labels + GraphSource root | forgeRunner |
| graphEmbedder.js | Voyage-4 vector embeddings + per-source fulltext/vector index creation | forgeRunner |
| graphSearchTool.js | Generic hybrid BM25 + vector search, stats, explore, rawCypher, history | forgeRunner, generated search CLIs |
| bridgeBuilder.js | Execute bridge spec JSON files to create cross-standard edges | forgeRunner |
| providerGenerator.js | Auto-generate askMilo provider.json + search CLI from graph schema | forgeRunner |
| graphHistory.js | Timestamped audit trail — records every load, embed, bridge, wipe | graphLoader, graphEmbedder, bridgeBuilder |
| voyageClient.js | Single Voyage AI embedding client (voyage-4, 1024d) — THE source of truth | graphEmbedder, bridgeBuilder, graphSearchTool |
forgeRunner — The Shared Engine
Every forge CLI is ~30 lines of config that calls forgeRunner(config). forgeRunner handles:
- Bootstrap (config loading, process.global, targets.ini)
- Help text generation (dynamic from config)
- Command dispatch (all standard actions)
- Connection resolution (standalone container or external target)
- Super-label application post-load
// Example: forge-edmatrix/forgeEdMatrix.js (entire file)
const forgeRunner = require('../../graph-forge-core/lib/forgeRunner');
forgeRunner({
graphName: 'EdMatrix',
superLabel: 'EdMatrix',
toolPrefix: 'edmatrix',
cliName: 'forgeEdMatrix',
displayName: 'EdMatrix education standards graph',
description: '35 standards across categories, layers, organizations',
parser: require('./lib/parser'),
sourceFileName: 'edmatrix-data.json',
sourceConfigKey: 'forge-edmatrix',
hasBridges: true,
hasExport: true,
forgeDir: __dirname
});forgeRunner Config Options
| Option | Type | Required | Description |
|--------|------|----------|-------------|
| graphName | string | yes | Internal name, used as _source and GraphSource name |
| superLabel | string | yes | Neo4j label applied to all nodes, used for per-source indexes |
| toolPrefix | string | yes | Prefix for generated askMilo tools (must be globally unique) |
| cliName | string | no | Display name in help text (default: forge{graphName}) |
| displayName | string | no | One-line description for help header |
| description | string | no | Full description for provider.json |
| parser | function | yes | Parser module: (sourcePath, options, callback) => {} |
| sourceFileName | string | no | Default source data filename |
| sourceConfigKey | string | no | Key in [sourceData] config section |
| sourceParamName | string | no | CLI param name override (default: 'source') |
| sourceRequired | boolean | no | If true, source path must be provided (no default) |
| hasBridges | boolean | no | Enable -bridge command |
| hasExport | boolean | no | Enable -export command (default: true) |
| forgeDir | string | yes | __dirname of the forge CLI (for resolving bridge specs, assets) |
| postLoadHook | function | no | Called after graphLoader: (parseResult, connInfo, callback) => {} |
| additionalLoadOptions | function | no | Returns extra parser options: (getVal, sourceDir) => ({}) |
| additionalSourceFiles | object | no | Extra source files: { resolutionMapPath: 'file.tsv' } |
voyageClient — Single Embedding Client
All Voyage AI calls go through this module. One model, one dimension, one place to change.
const voyageClient = require('./voyageClient');
// voyageClient.MODEL = 'voyage-4'
// voyageClient.DIMENSION = 1024
// voyageClient.BATCH_SIZE = 20
voyageClient.embed(['text1', 'text2'], apiKey, (err, embeddings) => {});Key Conventions
Super-Labels and Indexes
Every source must define a super-label. Vector and fulltext indexes are created ON the super-label, not on :ForgedNode. This ensures bridge similarity queries search only the target source's embeddings.
| Source | Super-Label | Vector Index | Fulltext Index |
|--------|------------|--------------|----------------|
| CEDS | :CEDS | ceds_vector | ceds_fulltext |
| SIF | :SifModel | sif_vector | sif_fulltext |
| EdMatrix | :EdMatrix | edmatrix_vector | edmatrix_fulltext |
| CareerStories | :CareerStoryModel | careerstories_vector | careerstories_fulltext |
| Himed | :HimedModel | himed_vector | himed_fulltext |
Node Labeling
Every node gets three labels:
:ForgedNode— universal marker- Source-specific (
:CedsProperty,:SifField,:EdStandard) - Super-label (
:CEDS,:SifModel,:EdMatrix)
Required Node Properties
Every node: _id (unique within source), _source (matches GraphSource name), name, description.
Parser Contract — Complete Reference
The parser is the ONLY custom code per data source. It reads a source file and returns a standard array of node objects. The graphLoader handles everything else.
Signature
module.exports = (sourcePath, options, callback) => {
// sourcePath: absolute path to the source data file
// options: object with parser-specific settings (chunkStrategy, resolutionMapPath, etc.)
// callback: (error, result) — error is a string (truthy = failure), result is { nodes, metadata }
callback('', {
nodes: [ /* array of node objects */ ],
metadata: { version: '1.0', sourceFormat: 'json' }
});
};The Node Object
Each element in the nodes array must have this shape:
{
id: 'unique-within-source', // REQUIRED. String. Becomes _id in Neo4j.
// Must be unique within THIS source (not globally).
// Convention: 'type-slugified-name' (e.g., 'standard-ceds', 'type-organizational')
label: 'EdStandard', // REQUIRED. String. Becomes a Neo4j node label.
// Use PascalCase. This is the source-specific label
// (the super-label and :ForgedNode are added automatically).
properties: {
name: 'CEDS', // REQUIRED. String. Human-readable display name.
// Used by -explore, displayed in search results.
description: 'Common vocab…', // REQUIRED. String. Searchable text.
// Indexed for BM25 fulltext search.
// Embedded as a vector for semantic search.
// Make this as descriptive as possible — it drives search quality.
// Any additional properties are preserved as-is on the Neo4j node:
url: 'http://ceds.ed.gov/',
org: 'US Ed',
types: 'Organizational, Personal, Event',
// Numbers, booleans, arrays of strings — all valid Neo4j property types.
},
edges: [ // OPTIONAL. Array of outgoing relationships.
{
type: 'HAS_TYPE', // REQUIRED. String. Neo4j relationship type.
// Convention: UPPER_SNAKE_CASE.
targetId: 'type-organizational', // REQUIRED. String. Must match the `id` of another
// node in this same nodes array.
targetLabel: 'DataCategory', // OPTIONAL. String. If the target node doesn't exist
// as a top-level node, the loader auto-creates a
// minimal node with this label. Useful for creating
// category/type nodes from edge references.
properties: {} // OPTIONAL. Object. Properties set on the relationship.
}
]
}What the Loader Does With Each Node
- Creates the node with labels
:ForgedNode:YourLabel(super-label added post-load) - Sets
_idfromnode.id - Sets
_sourcefrom the forge'sgraphNameconfig - Copies all
node.propertiesas Neo4j properties - For each edge, matches source and target by
_id+_sourceand creates the relationship - If an edge's
targetIddoesn't match any top-level node, creates a minimal target node with thetargetLabel
Edge Auto-Creation
You don't need to create both ends of a relationship as top-level nodes. If your data has standards with types, you can define type nodes inline via edges:
// This standard references a DataCategory that may not exist yet
{
id: 'standard-ceds',
label: 'EdStandard',
properties: { name: 'CEDS', description: '...' },
edges: [
{ type: 'HAS_TYPE', targetId: 'type-organizational', targetLabel: 'DataCategory' }
]
}If type-organizational doesn't appear as a top-level node, the loader creates:
(:ForgedNode:DataCategory { _id: 'type-organizational', _source: 'EdMatrix', name: 'type-organizational' })But it's better to create explicit top-level nodes with proper names and descriptions:
// Explicit node — better for search and display
{ id: 'type-organizational', label: 'DataCategory', properties: { name: 'Organizational', description: 'Data category: Organizational' }, edges: [] },
// Then reference it from the standard
{ id: 'standard-ceds', label: 'EdStandard', properties: { name: 'CEDS', description: '...' }, edges: [
{ type: 'HAS_TYPE', targetId: 'type-organizational', targetLabel: 'DataCategory' }
] }The Metadata Object
metadata: {
version: '1.0', // Version of the source data (shown in GraphSource node)
sourceFormat: 'json' // Format identifier (json, rdf, tsv, txt, etc.)
}Rules
- Parser NEVER touches Neo4j. No
require('neo4j-driver'). No database connections. Pure data transformation. - Parser NEVER imports graph-forge-core modules. It's standalone — testable without Docker, without a running database.
- Use callback pattern.
callback(errorString, result). Error is a truthy string on failure, empty string on success. - No async/await. Use callbacks per TQ coding standards.
- Every node must have
id,label,properties.name,properties.description. The loader and embedder depend on these. - IDs must be unique within the source. Use slugified compound names:
'standard-ceds','type-organizational','layer-data-dictionary'. - Edge targetIds must match an id in the same source. Cross-source edges are built by bridge specs, not parsers.
Common Patterns
Deduplication: When multiple source records reference the same entity (e.g., many standards published by "1EdTech"), use a Set to track what you've already emitted:
const seenOrgs = new Set();
rawData.forEach((entry) => {
const orgId = `org-${slugify(entry.org)}`;
if (!seenOrgs.has(orgId)) {
seenOrgs.add(orgId);
nodes.push({ id: orgId, label: 'Organization', properties: { name: entry.org, description: `Publisher: ${entry.org}` }, edges: [] });
}
// Edge from standard to org
standardNode.edges.push({ type: 'PUBLISHED_BY', targetId: orgId, targetLabel: 'Organization' });
});Slugification: Convert names to URL-safe lowercase IDs:
const slugify = (str) => str.toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/(^-|-$)/g, '');Testing: Run the parser standalone to verify counts before loading:
const parser = require('./lib/parser');
parser('/path/to/data.json', {}, (err, result) => {
console.log(`Nodes: ${result.nodes.length}`);
const labels = {};
result.nodes.forEach(n => { labels[n.label] = (labels[n.label] || 0) + 1; });
console.log('Labels:', labels);
});Real Parser Examples
| Parser | Lines | Source Format | Nodes Produced | Complexity | |--------|-------|-------------|----------------|------------| | forge-edmatrix | 87 | JSON array of standards | 83 (standards + categories + layers + orgs + formats) | Simple | | forge-ceds-rdf | 200 | RDF/XML ontology | 23,238 (classes + properties + option sets + values) | Complex (XML parsing) | | forge-sif-tsv | 614 | TSV specification + resolution map | 23,005 (objects + fields + types + codesets + XML elements) | Complex (RefId resolution) | | forge-career-stories | 223 | Text documents (glob) | Variable (documents + chunks) | Medium (text chunking) | | forge-himed | 150 | Tab-delimited CRM export | ~26,000 (call reports + customers + contacts) | Medium (deduplication) |
Directory Structure
cli/
graph-forge-core/ ← THIS LIBRARY (peer of lib.d/, not inside it)
lib/
forgeRunner.js ← shared CLI engine
containerManager.js ← Docker Neo4j lifecycle
graphLoader.js ← nodes → Neo4j
graphEmbedder.js ← Voyage embeddings + indexes
graphSearchTool.js ← search, stats, explore, rawCypher, history
bridgeBuilder.js ← cross-standard edge creation
providerGenerator.js ← askMilo provider.json generation
graphHistory.js ← audit trail
voyageClient.js ← Voyage AI client (single source of truth)
package.json ← dependencies (neo4j-driver, qtools-*)
node_modules/
lib.d/
forge-edmatrix/ ← 30 lines + parser
forge-ceds-rdf/ ← 30 lines + parser (will be refactored)
forge-sif-tsv/ ← 30 lines + parser (will be refactored)
forge-career-stories/ ← 30 lines + parser (will be refactored)
forge-himed/ ← 30 lines + parser (will be refactored)
rebuild-dme/ ← orchestration script
rebuild-himed/ ← orchestration scriptHow Forge CLIs Require This Library
From cli/lib.d/forge-xxx/forgeXxx.js:
const forgeRunner = require('../../graph-forge-core/lib/forgeRunner');The path ../../graph-forge-core/ goes up from lib.d/forge-xxx/ to cli/, then into graph-forge-core/.
