@cascadeflow/langchain

v1.1.0

Published

4 months ago

LangChain integration for cascadeflow - Add intelligent cost optimization to your LangChain models

Downloads

107

0High
0Medium
0Low

saschabuehrle

lemony-ai

langchain cascadeflow ai llm cost-optimization model-routing cascade typescript

@cascadeflow/langchain

LangChain integration for CascadeFlow - Add intelligent cost optimization to your existing LangChain models without reconfiguration.

Features

🔄 Zero Code Changes - Wrap your existing LangChain models, no refactoring needed
💰 Automatic Cost Optimization - Save 40-60% on LLM costs through intelligent cascading
🎯 Quality-Based Routing - Only escalate to expensive models when quality is insufficient
📊 Full Visibility - Track costs, quality scores, and cascade decisions
🔗 Chainable - All LangChain methods (bind(), bindTools(), etc.) work seamlessly
📈 LangSmith Ready - Automatic cost metadata injection for observability
🧭 Domain Policies - Per-domain threshold/routing overrides (qualityThreshold, forceVerifier, directToVerifier)
🔁 CascadeAgent - Built-in closed-loop tool agent for multi-turn execution with max-step protection

Installation

npm install @cascadeflow/langchain @langchain/core
# or
pnpm add @cascadeflow/langchain @langchain/core
# or
yarn add @cascadeflow/langchain @langchain/core

Quick Start

import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';

// Step 1: Configure your existing models (no changes needed!)
const drafter = new ChatOpenAI({
  model: 'gpt-5-mini',  // Fast, cheap model ($0.25/$2 per 1M tokens)
  temperature: 0.7
});

const verifier = new ChatAnthropic({
  model: 'claude-opus-4-6',  // Accurate, expensive model ($15/$75 per 1M tokens)
  temperature: 0.7
});

// Step 2: Wrap with cascade (just 2 lines!)
const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityThreshold: 0.7,  // Quality bar for accepting drafter responses
});

// Step 3: Use like any LangChain model!
const result = await cascadeModel.invoke("What is TypeScript?");
console.log(result.content);

// Step 4: Check cascade statistics
const stats = cascadeModel.getLastCascadeResult();
console.log(`Model used: ${stats.modelUsed}`);
console.log(`Cost: $${stats.totalCost.toFixed(6)}`);
console.log(`Savings: ${stats.savingsPercentage.toFixed(1)}%`);

// Optional: Enable LangSmith tracing (see traces at https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
// Your ChatOpenAI/ChatAnthropic models will appear in LangSmith with cascade metadata

How It Works

CascadeFlow uses speculative execution to optimize costs:

Try Drafter First - Executes the cheap, fast model
Quality Check - Validates the response quality using heuristics or custom validators
Cascade if Needed - Only calls the expensive model if quality is below threshold
Track Everything - Records costs, latency, and cascade decisions

This approach provides:

✅ No Latency Penalty - Drafter responses are instant when quality is high
✅ Quality Guarantee - Verifier ensures high-quality responses for complex queries
✅ Cost Savings - 40-60% reduction in API costs on average

Configuration

Basic Configuration

const cascadeModel = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
  verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
  qualityThreshold: 0.7,  // Default: 0.7 (70%)
});

Custom Quality Validator

const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityValidator: async (response) => {
    // Custom logic - return quality score 0-1
    const text = response.generations[0].text;

    // Example: Use length and keywords
    const hasKeywords = ['typescript', 'javascript'].some(kw =>
      text.toLowerCase().includes(kw)
    );

    return text.length > 50 && hasKeywords ? 0.9 : 0.4;
  },
});

Disable Cost Tracking

const cascadeModel = withCascade({
  drafter,
  verifier,
  enableCostTracking: false,  // Disable metadata injection
});

Domain Policies

Use domain-specific routing rules without changing your chain code:

const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityThreshold: 0.7,
  domainPolicies: {
    finance: { qualityThreshold: 0.5 }, // Easier acceptance for finance queries
    medical: { forceVerifier: true }, // Always verify after drafting
    legal: { directToVerifier: true }, // Skip drafter entirely
  },
});

const legalCascade = cascadeModel.bind({
  metadata: { cascadeflow_domain: "legal" },
});

const result = await legalCascade.invoke("Review this contract clause");

Advanced Usage

Streaming Responses

CascadeFlow supports real-time streaming with optimistic drafter execution:

const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-4o-mini' }),
  verifier: new ChatOpenAI({ model: 'gpt-4o' }),
});

// Stream responses in real-time
const stream = await cascade.stream('Explain TypeScript');

for await (const chunk of stream) {
  process.stdout.write(chunk.content);
}

How Streaming Works:

Optimistic Streaming (text-only) - Drafter response streams immediately (user sees output in real-time)
Quality Check - After drafter completes, quality is validated
Optional Cascade - If quality is insufficient, verifier output is streamed; switch notices are off by default and can be enabled via metadata (cascadeflow_emit_switch_message)
Tool-safe Streaming - When tools are bound with bindTools(...), output is buffered until final routing so tool-call deltas stay consistent

This provides the best user experience with no perceived latency for queries the drafter can handle.

Chaining with bind()

All LangChain chainable methods work seamlessly:

const cascadeModel = withCascade({ drafter, verifier });

// bind() works
const boundModel = cascadeModel.bind({ temperature: 0.1 });
const result = await boundModel.invoke("Be precise");

// Chain multiple times
const doubleChained = cascadeModel
  .bind({ temperature: 0.5 })
  .bind({ maxTokens: 100 });

Tool Calling

const tools = [
  {
    name: 'calculator',
    description: 'Useful for math calculations',
    func: async (input: string) => {
      // Use a strict parser helper (see examples/nodejs/safe-math.ts).
      return safeCalculateExpression(input).toString();
    },
  },
];

const modelWithTools = cascadeModel.bindTools(tools);
const result = await modelWithTools.invoke("What is 25 * 4?");

Structured Output

const schema = {
  name: 'person',
  schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      age: { type: 'number' },
    },
  },
};

const structuredModel = cascadeModel.withStructuredOutput(schema);
const result = await structuredModel.invoke("Extract: John is 30 years old");
// Result is typed according to schema

Agentic Tool Loops (`CascadeAgent`)

CascadeAgent adds a closed agent/tool loop with explicit max-step safety:

import { CascadeAgent, withCascade } from '@cascadeflow/langchain';

const cascadeModel = withCascade({
  drafter,
  verifier,
  domainPolicies: {
    legal: { directToVerifier: true },
    medical: { forceVerifier: true },
  },
});

const agent = new CascadeAgent({
  model: cascadeModel.bindTools(tools),
  maxSteps: 6,
  toolHandlers: {
    // Use the same strict parser helper (see examples/nodejs/safe-math.ts).
    calculator: async ({ expression }) => safeCalculateExpression(expression).toString(),
  },
});

const run = await agent.run(
  [{ role: 'user', content: 'What is (25 * 4) + 10?' }],
  { systemPrompt: 'You are a precise calculator assistant.' }
);

console.log(run.status, run.steps, run.message.content);

Input can be a string, LangChain BaseMessage[], or role/content message list for multi-turn conversations. CascadeAgent also supports multi-tool calls in a single step and keeps system prompts at the front of looped executions.

Accessing Cascade Statistics

const result = await cascadeModel.invoke("Complex question");

const stats = cascadeModel.getLastCascadeResult();
console.log({
  content: stats.content,
  modelUsed: stats.modelUsed,  // 'drafter' or 'verifier'
  accepted: stats.accepted,  // Was drafter response accepted?
  drafterQuality: stats.drafterQuality,  // 0-1 quality score
  drafterCost: stats.drafterCost,  // $ spent on drafter
  verifierCost: stats.verifierCost,  // $ spent on verifier
  totalCost: stats.totalCost,  // Total $ spent
  savingsPercentage: stats.savingsPercentage,  // % saved vs verifier-only
  latencyMs: stats.latencyMs,  // Total latency in ms
});

LangSmith Integration

CascadeFlow works seamlessly with LangSmith for observability and cost tracking.

What You'll See in LangSmith

When you enable LangSmith tracing, you'll see:

Your Actual Chat Models - ChatOpenAI, ChatAnthropic, etc. appear as separate traces
Cascade Metadata - Decision info attached to each response
Token Usage & Costs - Server-side calculation by LangSmith
Nested Traces - Parent CascadeFlow trace with child model traces

Enabling LangSmith

// Set environment variables
process.env.LANGSMITH_API_KEY = 'lsv2_pt_...';
process.env.LANGSMITH_PROJECT = 'your-project';
process.env.LANGSMITH_TRACING = 'true';

// Use CascadeFlow normally - tracing happens automatically
const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
  verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
  costTrackingProvider: 'cascadeflow', // Default (local pricing)
});

const result = await cascade.invoke("Your query");

Viewing Traces

In your LangSmith dashboard (https://smith.langchain.com):

For cascaded queries - You'll see only the drafter model trace (e.g., ChatOpenAI with gpt-5-mini)
For escalated queries - You'll see BOTH drafter AND verifier traces (e.g., ChatOpenAI gpt-5-mini + ChatAnthropic claude-opus-4-6)
Metadata location - Click any trace → Outputs → response_metadata → cascade

Example Metadata

{
  "cascade": {
    "cascade_decision": "cascaded",
    "model_used": "drafter",
    "drafter_quality": 0.85,
    "savings_percentage": 66.7,
    "drafter_cost": 0,      // Calculated by LangSmith
    "verifier_cost": 0,     // Calculated by LangSmith
    "total_cost": 0         // Calculated by LangSmith
  }
}

Note: costTrackingProvider: 'cascadeflow' (default) computes costs locally using CascadeFlow's pricebook. If you use costTrackingProvider: 'langsmith', costs are calculated server-side and shown in the LangSmith UI (local cost values will be $0).

See docs/COST_TRACKING.md for more details on cost tracking options.

Supported Models

Works with any LangChain-compatible chat model:

OpenAI

import { ChatOpenAI } from '@langchain/openai';

const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatOpenAI({ model: 'gpt-5' });

Anthropic

import { ChatAnthropic } from '@langchain/anthropic';

const drafter = new ChatAnthropic({ model: 'claude-haiku-4-5-20251001' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });

Mix and Match (Recommended)

// Use different providers for optimal cost/quality balance!
const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });

Cost Optimization Tips

Choose Your Drafter Wisely - Use the cheapest model that can handle most queries
- GPT-5-mini: $0.25/$2.00 per 1M tokens (input/output)
- GPT-4o-mini: $0.15/$0.60 per 1M tokens (input/output)
- Claude Haiku 4.5: $0.80/$4.00 per 1M tokens
Tune Quality Threshold - Higher threshold = more cascades = higher cost but better quality
- 0.6 - Aggressive cost savings, may sacrifice some quality
- 0.7 - Balanced (recommended default)
- 0.8 - Conservative, ensures high quality

Use Custom Validators - Domain-specific validation can improve accuracy

qualityValidator: (response) => {
  const text = response.generations[0].text;
  // Check for specific requirements
  return hasRelevantKeywords(text) && meetsLengthRequirement(text) ? 0.9 : 0.5;
}

Performance

Typical cascade behavior:

| Query Type | Drafter Hit Rate | Avg Latency | Cost Savings | |-----------|------------------|-------------|--------------| | Simple Q&A | 85% | 500ms | 55-65% | | Complex reasoning | 40% | 1200ms | 20-30% | | Code generation | 60% | 800ms | 35-45% | | Overall | 70% | 700ms | 40-60% |

TypeScript Support

Full TypeScript support with type inference:

import type { CascadeConfig, CascadeResult } from '@cascadeflow/langchain';

const config: CascadeConfig = {
  drafter,
  verifier,
  qualityThreshold: 0.7,
};

const stats: CascadeResult | undefined = cascadeModel.getLastCascadeResult();

Examples

See the examples directory for complete working examples:

basic-usage.ts - Getting started guide
streaming-cascade.ts - Real-time streaming with optimistic drafter execution
lcel-pipeline.ts - LCEL runnable composition with CascadeFlow
tool-risk-gating.ts - Tool-call acceptance + high-risk verifier gating
langgraph-multi-agent.ts - Optional LangGraph multi-agent pattern

API Reference

`withCascade(config: CascadeConfig): CascadeFlow`

Creates a cascade-wrapped LangChain model.

Parameters:

config.drafter - The cheap, fast model
config.verifier - The accurate, expensive model
config.qualityThreshold? - Minimum quality to accept drafter (default: 0.7)
config.qualityValidator? - Custom function to calculate quality
config.enableCostTracking? - Enable LangSmith metadata injection (default: true)
config.costTrackingProvider? - 'cascadeflow' (default, local pricing) or 'langsmith' (server-side)
config.domainPolicies? - Per-domain overrides: qualityThreshold, forceVerifier, directToVerifier

Returns: CascadeFlow - A LangChain-compatible model with cascade logic

`new CascadeAgent(config: CascadeAgentConfig)`

Creates a closed-loop agent around a LangChain model (or directly from cascade config).

Parameters:

config.model? - Any LangChain chat model (often withCascade(...).bindTools(...))
config.cascade? - Optional CascadeConfig used to create an internal CascadeFlow
config.maxSteps? - Loop safety cap (default: 8)
config.toolHandlers? - Tool name to handler map

`CascadeAgent.run(input, options?): Promise<CascadeAgentRunResult>`

Runs model/tool/model loops until completion or maxSteps is reached.

Returns: CascadeAgentRunResult with:

message - Final AIMessage
messages - Full message history (including tool messages)
steps - Executed model turns
status - 'completed' | 'max_steps_reached'
toolCalls - Collected tool calls across steps

`CascadeFlow.getLastCascadeResult(): CascadeResult | undefined`

Returns statistics from the last cascade execution.

Returns: CascadeResult with:

content - The final response text
modelUsed - Which model provided the response ('drafter' | 'verifier')
accepted - Whether drafter response was accepted
drafterQuality - Quality score of drafter response (0-1)
drafterCost - Cost of drafter call
verifierCost - Cost of verifier call (0 if not used)
totalCost - Total cost
savingsPercentage - Percentage saved vs verifier-only
latencyMs - Total latency in milliseconds

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

@cascadeflow/core - Core CascadeFlow Python library
LangChain - Framework for LLM applications
LangSmith - LLM observability platform

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@cascadeflow/langchain

Features

Installation

Quick Start

How It Works

Configuration

Basic Configuration

Custom Quality Validator

Disable Cost Tracking

Domain Policies

Advanced Usage

Streaming Responses

Chaining with bind()

Tool Calling

Structured Output

Agentic Tool Loops (CascadeAgent)

Accessing Cascade Statistics

LangSmith Integration

What You'll See in LangSmith

Enabling LangSmith

Viewing Traces

Example Metadata

Supported Models

OpenAI

Anthropic

Mix and Match (Recommended)

Cost Optimization Tips

Performance

TypeScript Support

Examples

API Reference

withCascade(config: CascadeConfig): CascadeFlow

new CascadeAgent(config: CascadeAgentConfig)

CascadeAgent.run(input, options?): Promise<CascadeAgentRunResult>

CascadeFlow.getLastCascadeResult(): CascadeResult | undefined

Contributing

License

Related

Agentic Tool Loops (`CascadeAgent`)

`withCascade(config: CascadeConfig): CascadeFlow`

`new CascadeAgent(config: CascadeAgentConfig)`

`CascadeAgent.run(input, options?): Promise<CascadeAgentRunResult>`

`CascadeFlow.getLastCascadeResult(): CascadeResult | undefined`