@blade47/semantic-test

v1.0.5

Published

5 months ago

A composable, pipeline-based testing framework for AI systems and APIs with semantic validation

npm install @blade47/semantic-test

Why SemanticTest?

Testing AI systems is hard. Responses are non-deterministic, you need to validate tool usage, and semantic meaning matters more than exact text matching.

SemanticTest solves this with:

Composable blocks for HTTP, parsing, validation, and AI evaluation
Pipeline architecture where data flows through named slots
LLM Judge to evaluate responses semantically using GPT-4
JSON test definitions that are readable and version-controllable

Quick Start

1. Install

npm install @blade47/semantic-test

2. Create a test

{
  "name": "API Test",
  "version": "1.0.0",
  "context": {
    "BASE_URL": "https://api.example.com"
  },
  "tests": [
    {
      "id": "get-user",
      "name": "Get User",
      "pipeline": [
        {
          "id": "request",
          "block": "HttpRequest",
          "input": {
            "url": "${BASE_URL}/users/1",
            "method": "GET"
          },
          "output": "response"
        },
        {
          "id": "parse",
          "block": "JsonParser",
          "input": "${response.body}",
          "output": "user"
        },
        {
          "id": "validate",
          "block": "ValidateContent",
          "input": {
            "from": "user.parsed.name",
            "as": "text"
          },
          "config": {
            "contains": "John"
          },
          "output": "validation"
        }
      ],
      "assertions": {
        "response.status": 200,
        "user.parsed.id": 1,
        "validation.passed": true
      }
    }
  ]
}

3. Run it

npx semtest test.json

Core Concepts

Pipelines

Tests are pipelines of blocks that execute in sequence:

HttpRequest → JsonParser → Validate → Assert

Each block:

Reads inputs from named slots
Does one thing well
Writes outputs to named slots

Data Flow

Data flows through a DataBus with named slots:

{
  "pipeline": [
    {
      "id": "fetch",
      "block": "HttpRequest",
      "output": "response"        // Writes to 'response' slot
    },
    {
      "id": "parse",
      "block": "JsonParser",
      "input": "${response.body}",  // Reads from 'response.body'
      "output": "data"              // Writes to 'data' slot
    }
  ]
}

Three Input Formats

1. String - becomes { body: value }

"input": "${response.body}"

2. From/As - maps slot to parameter

"input": {
  "from": "response.body",
  "as": "text"
}

3. Object - deep resolves all values

"input": {
  "url": "${BASE_URL}/api",
  "method": "POST",
  "headers": {
    "Authorization": "Bearer ${token}"
  }
}

Three Output Formats

1. String - stores entire output

"output": "myResult"

2. Object - maps output fields to slots

"output": {
  "parsed": "data",
  "error": "parseError"
}

3. Default - uses block ID

{
  "id": "parse"
  // Output stored in 'parse' slot
}

Available Blocks

HTTP

HttpRequest - Make HTTP requests

{
  "block": "HttpRequest",
  "input": {
    "url": "https://api.example.com/users",
    "method": "POST",
    "headers": {
      "Authorization": "Bearer ${token}"
    },
    "body": {
      "name": "John Doe"
    },
    "timeout": 5000
  }
}

Parsers

JsonParser - Parse JSON

{
  "block": "JsonParser",
  "input": "${response.body}"
}

StreamParser - Parse streaming responses

{
  "block": "StreamParser",
  "input": "${response.body}",
  "config": {
    "format": "sse-vercel"  // or "sse-openai", "sse"
  }
}

Outputs: text, toolCalls, chunks, metadata

Validators

ValidateContent - Validate text

{
  "block": "ValidateContent",
  "input": {
    "from": "data.message",
    "as": "text"
  },
  "config": {
    "contains": ["success", "confirmed"],
    "notContains": ["error", "failed"],
    "minLength": 10,
    "maxLength": 1000,
    "matches": "^[A-Z].*"
  }
}

ValidateTools - Validate AI tool usage

{
  "block": "ValidateTools",
  "input": {
    "from": "parsed.toolCalls",
    "as": "toolCalls"
  },
  "config": {
    "expected": ["search_database", "send_email"],
    "forbidden": ["delete_all"],
    "order": ["search_database", "send_email"],
    "minTools": 1,
    "maxTools": 5,
    "validateArgs": {
      "send_email": {
        "to": "[email protected]"
      }
    }
  }
}

AI Judge

LLMJudge - Semantic evaluation with GPT-4

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "toolCalls": "${response.toolCalls}",
    "expected": {
      "expectedBehavior": "Should greet the user and offer to help with their calendar"
    }
  },
  "config": {
    "model": "gpt-4o-mini",
    "criteria": {
      "accuracy": 0.4,
      "completeness": 0.3,
      "relevance": 0.3
    }
  }
}

Returns: score (0-1), reasoning, shouldContinue, nextPrompt

Control Flow

Loop - Loop back to previous blocks

{
  "block": "Loop",
  "config": {
    "target": "retry-request",
    "maxIterations": 3
  }
}

Test Suites

Organize multiple tests with shared setup/teardown:

{
  "name": "User API Tests",
  "version": "1.0.0",
  "context": {
    "BASE_URL": "${env.API_URL}",
    "API_KEY": "${env.API_KEY}"
  },
  "setup": [
    {
      "id": "auth",
      "block": "HttpRequest",
      "input": {
        "url": "${BASE_URL}/auth/login",
        "method": "POST",
        "body": {
          "username": "test",
          "password": "test123"
        }
      },
      "output": "auth"
    }
  ],
  "tests": [
    {
      "id": "create-user",
      "name": "Create User",
      "pipeline": [
        {
          "id": "request",
          "block": "HttpRequest",
          "input": {
            "url": "${BASE_URL}/users",
            "method": "POST",
            "headers": {
              "Authorization": "Bearer ${auth.body.token}"
            },
            "body": {
              "name": "Jane Doe"
            }
          },
          "output": "createResponse"
        }
      ],
      "assertions": {
        "createResponse.status": 201
      }
    },
    {
      "id": "get-user",
      "name": "Get User",
      "pipeline": [
        {
          "id": "request",
          "block": "HttpRequest",
          "input": {
            "url": "${BASE_URL}/users/${createResponse.body.id}",
            "method": "GET",
            "headers": {
              "Authorization": "Bearer ${auth.body.token}"
            }
          },
          "output": "getResponse"
        }
      ],
      "assertions": {
        "getResponse.status": 200,
        "getResponse.body.name": "Jane Doe"
      }
    }
  ],
  "teardown": [
    {
      "id": "cleanup",
      "block": "HttpRequest",
      "input": {
        "url": "${BASE_URL}/users/${createResponse.body.id}",
        "method": "DELETE",
        "headers": {
          "Authorization": "Bearer ${auth.body.token}"
        }
      }
    }
  ]
}

Assertions

Validate final results with operators:

{
  "assertions": {
    "response.status": 200,                    // Equality
    "data.count": { "gt": 10 },               // Greater than
    "data.count": { "lt": 100 },              // Less than
    "data.message": { "contains": "success" }, // Contains
    "data.email": { "matches": ".*@.*\\.com" } // Regex
  }
}

Environment Variables

Use .env file:

API_URL=https://api.example.com
API_KEY=secret123
OPENAI_API_KEY=sk-...

Reference in tests:

{
  "context": {
    "BASE_URL": "${env.API_URL}",
    "API_KEY": "${env.API_KEY}"
  }
}

Testing AI Systems

Example: Chat API

{
  "name": "AI Chat Tests",
  "context": {
    "CHAT_URL": "${env.CHAT_API_URL}",
    "API_KEY": "${env.API_KEY}"
  },
  "tests": [
    {
      "id": "chat-test",
      "name": "Chat with Tool Usage",
      "pipeline": [
        {
          "id": "chat",
          "block": "HttpRequest",
          "input": {
            "url": "${CHAT_URL}",
            "method": "POST",
            "headers": {
              "Authorization": "Bearer ${API_KEY}"
            },
            "body": {
              "messages": [
                {
                  "role": "user",
                  "content": "Search for users named John"
                }
              ]
            }
          },
          "output": "chatResponse"
        },
        {
          "id": "parse",
          "block": "StreamParser",
          "input": "${chatResponse.body}",
          "config": {
            "format": "sse-vercel"
          },
          "output": "parsed"
        },
        {
          "id": "validate-tools",
          "block": "ValidateTools",
          "input": {
            "from": "parsed.toolCalls",
            "as": "toolCalls"
          },
          "config": {
            "expected": ["search_users"]
          },
          "output": "toolValidation"
        },
        {
          "id": "judge",
          "block": "LLMJudge",
          "input": {
            "text": "${parsed.text}",
            "toolCalls": "${parsed.toolCalls}",
            "expected": {
              "expectedBehavior": "Should use search_users tool and confirm searching for John"
            }
          },
          "config": {
            "model": "gpt-4o-mini"
          },
          "output": "judgement"
        }
      ],
      "assertions": {
        "chatResponse.status": 200,
        "toolValidation.passed": true,
        "judgement.score": { "gt": 0.7 }
      }
    }
  ]
}

Why LLM Judge?

AI outputs vary. Exact text matching fails. Instead, use another LLM to evaluate semantic meaning:

"2:00 PM", "2 PM", "14:00" are all acceptable
Focuses on intent and helpfulness
Provides reasoning for failures
Configurable scoring criteria

Custom Blocks

Create a Block

// blocks/custom/MyBlock.js
import { Block } from '@blade47/semantic-test';

export class MyBlock extends Block {
  static get inputs() {
    return {
      required: ['data'],
      optional: ['config']
    };
  }

  static get outputs() {
    return {
      produces: ['result', 'metadata']
    };
  }

  async process(inputs, context) {
    const { data, config } = inputs;

    // Your logic
    const result = await processData(data, config);

    return {
      result,
      metadata: { timestamp: Date.now() }
    };
  }
}

Register It

import { blockRegistry } from '@blade47/semantic-test';
import { MyBlock } from './blocks/custom/MyBlock.js';

blockRegistry.register('MyBlock', MyBlock);

Use It

{
  "block": "MyBlock",
  "input": {
    "data": "${previous.output}",
    "config": { "mode": "fast" }
  },
  "output": "myResult"
}

See blocks/examples/ for complete examples.

CLI

# Run single test
npx semtest test.json

# Run multiple tests
npx semtest tests/*.json

# Generate HTML report
npx semtest test.json --html

# Custom output file
npx semtest test.json --html --output report.html

# Debug mode
LOG_LEVEL=DEBUG npx semtest test.json

Programmatic Usage

import { PipelineBuilder } from '@blade47/semantic-test';
import fs from 'fs/promises';

const testDef = JSON.parse(await fs.readFile('test.json', 'utf-8'));
const pipeline = PipelineBuilder.fromJSON(testDef);

const result = await pipeline.execute();

if (result.success) {
  console.log('Test passed!');
} else {
  console.error('Test failed:', result.error);
}

Examples

See test-examples/ directory:

simple-api-test.json - Basic REST API testing
validation-test.json - HTTP request, JSON parsing, and content validation
mock-ai-validation.json - MockData block with AI response validation
conditions-example.json - Advanced assertions and conditional loops

Advanced Features

Multi-turn Conversations

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "history": [
      { "role": "user", "content": "Hello" },
      { "role": "assistant", "content": "Hi there!" },
      { "role": "user", "content": "What's the weather?" }
    ]
  },
  "config": {
    "continueConversation": true,
    "maxTurns": 5
  }
}

Custom Stream Parsers

import { StreamParser } from '@blade47/semantic-test';

function myCustomParser(body) {
  // Parse your custom format
  return {
    text: extractedText,
    toolCalls: extractedTools,
    chunks: allChunks,
    metadata: { format: 'custom' }
  };
}

StreamParser.register('my-format', myCustomParser);

Use it:

{
  "block": "StreamParser",
  "config": {
    "format": "my-format"
  }
}

Loop Control

{
  "pipeline": [
    {
      "id": "attempt",
      "block": "HttpRequest",
      "input": { "url": "${API_URL}" }
    },
    {
      "id": "check",
      "block": "ValidateContent",
      "input": { "from": "attempt.body", "as": "text" },
      "config": { "contains": "success" }
    },
    {
      "id": "retry",
      "block": "Loop",
      "config": {
        "target": "attempt",
        "maxIterations": 3
      }
    }
  ]
}

Best Practices

1. Use Meaningful Slot Names

// Good
"output": "userProfile"
"output": "authToken"

// Bad
"output": "data"
"output": "result"

2. Validate Early

{
  "pipeline": [
    { "block": "HttpRequest", "output": "response" },
    { "block": "JsonParser", "output": "data" },
    { "block": "ValidateContent" },  // Validate before expensive operations
    { "block": "LLMJudge" }          // Expensive: calls GPT-4
  ]
}

3. Use Setup/Teardown

Always clean up test data:

{
  "setup": [
    { "id": "create-test-data", "block": "..." }
  ],
  "tests": [ /* ... */ ],
  "teardown": [
    { "id": "delete-test-data", "block": "..." }
  ]
}

4. Semantic Validation for AI

Don't match exact text:

// Bad - too brittle
{
  "assertions": {
    "response.text": "The meeting is scheduled for 2:00 PM"
  }
}

// Good - semantic validation
{
  "block": "LLMJudge",
  "input": {
    "expected": {
      "expectedBehavior": "Should confirm meeting is scheduled for 2 PM"
    }
  }
}

Contributing

git clone https://github.com/blade47/semantic-test.git
cd semantic-test
npm install
npm test

Adding Blocks

Create block in blocks/[category]/YourBlock.js
Add tests in tests/unit/blocks/YourBlock.test.js
Register in src/core/BlockRegistry.js
Document in README

Testing

npm test              # All tests
npm run test:unit     # Unit tests only
npm run test:integration  # Integration tests
npm run test:watch    # Watch mode

License

MIT

Support

Documentation: https://docs.semantictest.dev
GitHub Issues: https://github.com/blade47/semantic-test/issues

Built for testing AI systems that don't play by traditional rules.