streaming-llm-responses

v0.2.0

Published

7 hours ago

React hook factory for streaming LLM responses over SSE — supports Q&A, RAG, and agentic modes.

0High
0Medium
0Low

brown_bear_prat

streaming-llm-responses

A small, reusable React hook factory for streaming LLM responses with multi-turn history.

What you get

Factory pattern for one-time app config.
Streaming updates for assistant tokens.
Completion detection (done signal or clean EOF).
Error handling and abort control.

Usage

// llm.ts
import { createUseLLM, createSSEParser } from "./src";

export const useLLM = createUseLLM({
  endpoint: "/api/llm",
  model: "gpt-4o-mini",
  systemPrompt: "You are a helpful assistant.",
  headers: { "x-app": "email-automation" },
  credentials: "include",
  createStreamParser: createSSEParser,
});

// Component.tsx
import { useLLM } from "./llm";

export function Chat() {
  const { messages, status, error, send, abort, reset, completionReason } = useLLM();

  return (
    <div>
      <button onClick={() => send("Hello")}>Send</button>
      <button onClick={abort}>Abort</button>
      <button onClick={reset}>Reset</button>
      <div>Status: {status}</div>
      <div>Completion: {completionReason ?? "-"}</div>
      {error ? <div>Error: {error}</div> : null}
      <ul>
        {messages.map((m) => (
          <li key={m.id}>
            <b>{m.role}:</b> {m.content}
          </li>
        ))}
      </ul>
    </div>
  );
}

Notes

The default SSE parser expects text/event-stream responses.
If your backend streams raw text or NDJSON, provide a custom createStreamParser.
The hook uses the browser fetch API and ReadableStream.

Architectural Decisions

This section explains the key choices behind this design and the tradeoffs.

Why a factory pattern instead of a React context provider The hook is created once with a stable configuration and then used directly anywhere.

Keeps React usage simple: no provider wiring or dependency on React context.
Makes it easy to create multiple independent hooks with different configs in the same app or tests.
Improves portability if you later reuse the hook in non-React-Context environments or libraries.
Avoids implicit global state and makes config explicit at creation time.

Minimal example (no provider needed):

// llm.ts
import { createUseLLM, createSSEParser } from "./src";

export const useLLM = createUseLLM({
  endpoint: "/api/llm",
  model: "gpt-4o-mini",
  createStreamParser: createSSEParser,
});

// Any component can just use the hook directly.
import { useLLM } from "./llm";

export function Chat() {
  const { send } = useLLM();
  return <button onClick={() => send("Hello")}>Send</button>;
}

When a provider can be better:

If you want to switch configs dynamically at runtime per subtree.
If you prefer a single global config without importing a preconfigured hook.

Provider wiring example (for comparison):

// LLMProvider.tsx
import React, { createContext, useContext, useMemo } from "react";
import type { LLMConfig } from "./src/types";

type LLMProviderProps = {
  config: LLMConfig;
  children: React.ReactNode;
};

const LLMContext = createContext<LLMConfig | null>(null);

export const LLMProvider: React.FC<LLMProviderProps> = ({ config, children }) => {
  // Keep the context value reference stable unless config actually changes.
  const value = useMemo(() => config, [config]);
  return <LLMContext.Provider value={value}>{children}</LLMContext.Provider>;
};

export function useLLMConfig() {
  const ctx = useContext(LLMContext);
  if (!ctx) throw new Error("useLLMConfig must be used inside <LLMProvider>");
  return ctx;
}

// useLLM.ts
import { useLLMConfig } from "./LLMProvider";

export function useLLM() {
  const config = useLLMConfig();
  // use config.endpoint, config.model, config.headers, config.streamParser, etc.
  return {};
}

// App.tsx
import { LLMProvider } from "./LLMProvider";
import { AppRoutes } from "./AppRoutes";

export default function App() {
  return (
    <LLMProvider
      config={{
        endpoint: "/api/llm",
        model: "gpt-4o-mini",
        systemPrompt: "You are a helpful assistant.",
      }}
    >
      <AppRoutes />
    </LLMProvider>
  );
}

What is SSE SSE (Server-Sent Events) is a standard HTTP streaming format where the server sends text events over a single long-lived response. The client reads events separated by blank lines and typically receives data: ... lines.

Why SSE instead of WebSockets (for this use case)

Simpler infrastructure: SSE uses plain HTTP, so it works with standard proxies, CDNs, and auth headers.
One-way streaming fits LLM token streaming well; the request is a normal POST and the response is a stream.
Lower operational overhead than WebSockets when you only need server → client streaming. This means no WS upgrade or special load-balancer rules, reuse of standard HTTP middleware and auth, and fewer moving parts to deploy and monitor.

When WebSockets can be better:

If you need full-duplex, low-latency messaging in both directions.
If you want to multiplex multiple interactive channels over one persistent connection.

Typical cases that need continuous back-and-forth in the same session:

Live voice or audio where the client streams audio up while the model streams audio/text down.
Real-time agent control where the client sends interrupts or new instructions mid-response.
Collaborative experiences that mix chat with presence, cursor, or telemetry streams.

Examples:

Agent control: the model is streaming a long reply, but the user says "stop, I only need the refund policy," so the client sends an interrupt or new instruction immediately without restarting the connection.
Collaboration: multiple users co-edit a draft while the LLM streams suggestions, and the app also streams cursor positions, typing indicators, and presence updates in the same session.

SSE Connections and Concurrency

Each send() call opens a single streaming response and closes when the stream completes. The factory pattern does not change that behavior; it only centralizes configuration. Multiple components can use useLLM() without opening connections until they call send(). If several components call send() at the same time, multiple SSE streams will be open concurrently.

To avoid too many concurrent streams, consider:

One shared hook instance in a parent component and pass send/messages/status down as props.
A simple queue or throttle so only one stream is active at a time.
A shared store or singleton LLM client that components subscribe to.
WebSocket multiplexing if you truly need many simultaneous streams.

-- Gemini response -- You’ve hit on a very real architectural pain point. If every component or hook in your app independently calls a factory to open a new SSE (Server-Sent Events) connection, you are definitely headed for a bottleneck.

Here is the breakdown of why this happens and how to fix it.

The Problem: Browser Connection Limits

Most modern browsers (like Chrome or Firefox) have a strict limit on the number of concurrent HTTP/1.1 connections to a single domain. This limit is usually 6 connections.

If you have a dashboard with seven different components each using a useLLM hook to open its own SSE stream:

The first 6 will connect.
The 7th will hang in "Pending" state.
If a user opens another tab of your app, it will be completely unresponsive because the connection pool is already exhausted.

The Solution: The "Singleton" Hub Pattern

Instead of a Factory Pattern (which creates a new instance every time), you should move toward a Singleton or a Shared Provider Pattern. You want to multiplex your requests through a single management layer.

1. Centralized SSE Manager

Rather than the hook opening the connection, the hook should "register" an interest with a central manager.

The manager maintains one persistent connection.
It uses a unique request_id to route the incoming LLM chunks to the specific component that asked for them.

2. Use React Context or State Management

Wrap your application in an LLMProvider. This provider holds the actual EventSource logic.

The Hook: useLLM now just sends a message to the Provider.
The Provider: Listens to the single stream and updates the specific state associated with that request_id.

3. HTTP/2 is a Game Changer

If your server and infrastructure support HTTP/2, the 6-connection limit effectively disappears. HTTP/2 uses multiplexing, allowing hundreds of streams over a single TCP connection. However, relying solely on HTTP/2 is risky if your users are behind corporate proxies or older load balancers that downgrade connections to HTTP/1.1.

Comparison of Approaches

| Feature | Factory Pattern (Multiple Connections) | Shared Provider (Single Connection) | | --- | --- | --- | | Connection Usage | High (1 per component) | Low (1 per app/tab) | | Complexity | Simple to implement | Moderate (requires routing logic) | | Scalability | Fails quickly on one page | Scales across many components | | Browser Support | Limited by 6-connection rule | Works universally |

Pro-Tip: If you are building a complex LLM app, consider using WebSockets instead of SSE if you need true bidirectional communication, as it also bypasses the standard SSE connection limits when managed correctly.

how does today's chat applications support this issue from the client side ? In modern chat applications (like ChatGPT, Claude, or high-scale enterprise dashboards), the "many logical streams over one connection" problem is usually handled through three distinct layers of engineering.

If you are looking at how to solve this from the client side without a massive backend overhaul, here is the industry-standard "hierarchy" of solutions:

1. The "Infrastructure" Shortcut: HTTP/2 (Most Common)

Today’s chat apps almost universally run on HTTP/2.

The Magic: Under HTTP/2, if you call new EventSource('/api/sse') six times, the browser does not open six TCP connections. It opens one and multiplexes the six streams inside it.
Client Side Role: You don't have to write any multiplexing logic. You just keep your "Factory" pattern. As long as your server/load-balancer supports HTTP/2, the browser handles the "pooling" for you.
Check it: Open DevTools > Network tab > Right-click headers to show "Protocol." If you see h2, you are already multiplexing at the transport layer.

2. The "State Management" Layer (React/Frontend)

To prevent the "Factory" from going rogue and opening too many connections, modern apps use a Registry Pattern inside a React Context or a custom hook.

Request Debouncing/Cleanup: Instead of a component opening a connection on mount and leaving it, the hook manages a ref to the EventSource.
Singleton-per-Intent: If two components need the same data (e.g., a "Current Message" status), they don't both open an SSE. They subscribe to a shared Observable or State that is fed by a single SSE connection managed by a Provider.

3. The "Shared Worker" (Cross-Tab Multiplexing)

One of the biggest issues with SSE is that if a user opens 10 browser tabs, they hit the 6-connection limit (on HTTP/1.1) or just waste resources (on HTTP/2).

Today's Fix: Apps use a SharedWorker.
How it works: Tab 1 opens the SSE connection inside a Worker. Tab 2, 3, and 4 connect to the same Worker. The Worker acts as the "Hub," receiving the LLM stream once and broadcasting it to all tabs via the postMessage API.

Comparison of Modern Strategies

| Approach | Where it's implemented | Solves... | | --- | --- | --- | | HTTP/2 | Server/Infra | Browser connection limits (the 6-limit). | | Shared Worker | Client (JS) | Multiple tabs/windows sharing one stream. | | BroadcastChannel API | Client (JS) | Syncing LLM state across different tabs. | | WebSockets | Server + Client | True bidirectional "Chat Hub" (replaces SSE entirely). |

Summary for your LLM Hook

If your components are all in one tab, your "Factory Pattern" is fine IF you ensure:

HTTP/2 is enabled on your server (this is the #1 priority).
Your useLLM hook has a useEffect cleanup that calls eventSource.close() immediately when the component unmounts or the request finishes.
You wrap your LLM logic in a Context Provider so that if three components need the same LLM response, only one connection is actually created.

Would you like a code example of a useLLM hook that safely cleans up connections to prevent "maxing out"?

i am trying to design a hook that will be used in an application that will support Q/A, Rag queries and agentic capabilites as well in a react app, what is the best apporach to implement in and why , and the idea here is to create the hook and use it as a resuable one in the application

To design a reusable hook for Q&A, RAG, and Agentic workflows, you need to move away from a simple "fetch and stream" model toward a State-Machine Driven Hook.

Agents are fundamentally different from standard chat because they are "loopy": they reason, they call tools (RAG), and they eventually answer. Your hook needs to reflect these varying states.

1. The Recommended Approach: The "Action-State" Hook

The best approach is to create a hook that returns a State Object and a Dispatcher. This allows your UI to react differently if the agent is "Thinking," "Searching" (RAG), or "Streaming" (Final Answer).

Why this approach?

Separation of Concerns: One hook handles the SSE logic, while your components only care about the UI state.
Reusability: The same hook can power a simple search bar, a sidebar chatbot, or an autonomous agent page.
Observability: By exposing status (e.g., searching_docs), you provide immediate feedback to the user, which is critical for slow Agent/RAG processes.

2. Implementation Architecture

You should use a useReducer inside your hook to manage the complexity of streaming tokens and metadata simultaneously.

// Define the states of your Agentic/RAG lifecycle
type AgentStatus = 'idle' | 'thinking' | 'searching' | 'streaming' | 'error';

interface Message {
  role: 'user' | 'assistant';
  content: string;
  sources?: any[]; // For RAG citations
  toolCalls?: any[]; // For Agentic tool tracking
}

const useLLM = (config: { endpoint: string }) => {
  const [state, dispatch] = useReducer(llmReducer, initialState);

  const query = async (prompt: string) => {
    // 1. Abort previous connection if it exists
    // 2. Open new SSE connection
    // 3. Listen for specialized events:
    //    - 'source': Update RAG citations
    //    - 'thought': Update agent's internal reasoning UI
    //    - 'token': Append to assistant message
  };

  return { ...state, query };
};

3. Key Design Features for Reusability

| Feature | Why it's needed for Agents/RAG | Implementation Detail | | --- | --- | --- | | Event Multiplexing | Agents send more than just text (e.g., tool logs). | Use event: thought or event: sources in your SSE stream to distinguish data types. | | Abort Controller | Users often change their mind mid-agent-run. | Tie an AbortController to the hook's cleanup so eventSource.close() is called automatically. | | Metadata Accumulator | RAG needs to show citations alongside text. | Have a separate state field for sources that updates once at the start of the stream. | | Token Batching | High-frequency updates can lag React. | Use a ref to buffer text and only update state every 50-100ms. |

4. Handling the Connection Bottleneck

To prevent the "maxing out" issue we discussed earlier while keeping the hook reusable:

Singleton Provider: Wrap your app in an LLMProvider.
Request Registry: The provider maintains a map of active request_ids.
The Hook as a Subscriber: The useLLM hook doesn't open the connection; it sends a message to the Provider and "subscribes" to updates for its specific ID.

Summary: The "Agentic" Hook Lifecycle

A truly reusable hook for 2026 should follow this flow:

Trigger: ask("How much did we spend on cloud?")
RAG Stage: Server sends event: sources. Hook updates sources array. UI shows "Searching documents...".
Agent Stage: Server sends event: thought. Hook updates thinkingLog. UI shows "Analyzing billing data...".
Final Response: Server sends event: message. Hook streams tokens into content.

Would you like a full TypeScript implementation of this useLLM reducer and SSE listener to use as a template?

For a deep dive into building these types of agentic interfaces, this video explains the "Reason and Act" pattern in modern apps: Mastering ReAct Agentic Workflows. It covers how agents loop through multiple LLM calls and tools, which is exactly the logic your hook will need to surface to the UI.

okay, if we need one single hook for Q/a query and RAG and a seperate for the agentic capabilites will that work ?

It can work, but from an architectural standpoint, splitting them into two completely separate hooks might actually create more work for you down the line.

The reason is that RAG and Agentic workflows share 90% of the same "plumbing" (SSE handling, token streaming, error management, and connection cleanup).

Here is the breakdown of why you might choose one over the other:

Option A: Two Separate Hooks (`useQuery` and `useAgent`)

This is the "cleaner" approach if your UI for a Q&A box is completely different from your Agentic dashboard.

Pros: Smaller bundle size per component; simpler state logic for basic Q&A.
Cons: You end up duplicating the SSE logic twice. If you fix a bug in how you handle connection timeouts in useQuery, you have to remember to fix it in useAgent too.
The Problem: Modern RAG is becoming "Agentic." For example, if a RAG query fails to find a document, a "smart" RAG system might decide to try a different search query. Suddenly, your "simple" RAG hook needs Agentic "thinking" states.

Option B: The "Unified" Hook (Recommended)

Instead of two hooks, use one robust hook that accepts a mode or type configuration. This is how professional LLM libraries (like Vercel AI SDK or LangChain) handle it.

Why the Unified Hook is better for 2026:

Shared SSE Manager: You only write the "max connection" protection logic once.
Polymorphic UI: Your UI can "upgrade" a request. A user starts with a simple Q&A, but the backend realizes it needs an Agent. If you use one hook, the UI just naturally transitions from status: 'streaming' to status: 'thinking'.
Consistent State: Both RAG and Agents need to display Sources/Citations. By using one hook, your SourceDisplay component can work across the entire app.

How to Structure the Unified Hook

To keep it from becoming a "messy" file, use a Reducer Pattern. This keeps the logic organized even if the hook handles multiple types of requests.

| State Feature | Q&A / RAG | Agentic | | --- | --- | --- | | data | Final Answer String | Incremental Thoughts + Final Answer | | metadata | Document Citations | Tool Names & Arguments | | status | loading -> streaming | thinking -> acting -> streaming |

The Verdict

If your "Agent" is a completely different backend service with a different protocol (e.g., WebSockets instead of SSE), then split them.

However, if they both talk to your LLM gateway via SSE, use one reusable hook. You can make it easy to use by providing "wrapper" functions:

// The internal "Power Hook"
const { stream, status, metadata } = useBaseLLM();

// Optional: Exported wrappers for developer convenience
export const useRAG = () => useBaseLLM({ mode: 'rag' });
export const useAgent = () => useBaseLLM({ mode: 'agent' });

Next Step

Would you like me to show you the "Unified Reducer" logic? It demonstrates how to handle event: 'thought' (Agent) and event: 'source' (RAG) in a single clean switch statement.

lets focus for the mode for q/a that should be able to expand for rag and agent, but handles the maxing out of the SSE events strategy and why Provider and substription and not shared worker ? and provide a plan

To build a professional-grade Q&A system that seamlessly scales into RAG and Agentic workflows, you need a "Request Orchestrator" pattern.

Why Provider + Subscription over Shared Workers?

In a 2026 React ecosystem, the Provider/Subscription pattern is almost always superior to a Shared Worker for your specific needs:

Complexity vs. Benefit: Shared Workers are great for syncing data across different tabs, but they are notoriously difficult to debug and don't support all mobile environments perfectly.
State Integration: A React Provider integrates natively with your UI state (React context, hooks, and signals). Shared Workers require a complex postMessage bridge that makes handling RAG metadata (like citations) or Agent "thinking" steps much clunkier.
The HTTP/2 Reality: Modern infra uses HTTP/2. The "max connection" problem is mostly solved at the protocol level. A Provider handles logical multiplexing (making sure two components don't fetch the same thing), while HTTP/2 handles network multiplexing.

The Implementation Plan

Phase 1: The Registry (The Singleton Hub)

Create an LLMProvider that acts as the single source of truth for all active streams. It maintains a "Registry" of active requests.

Key Logic: When a component calls ask(), it generates a uuid. The Provider opens the SSE and maps incoming data to that uuid.
Safety: The Provider tracks the count of active EventSource instances. If it hits a limit (e.g., 5), it queues the next request instead of failing.

Phase 2: The Unified Schema (Q&A → RAG → Agent)

Design your SSE events to be "Type-Aware." This allows one hook to handle all three modes by just listening for different event names.

| Event Name | Role | Data Payload | | --- | --- | --- | | token | Q&A | The actual text chunks. | | source | RAG | Array of document IDs, URLs, and snippets. | | call | Agent | Tool name and arguments (e.g., get_weather). | | thought | Agent | Internal reasoning (e.g., "I should check the database first"). |

Phase 3: The "Smart" Hook (`useLLM`)

This hook is a "Subscriber." It doesn't manage the EventSource; it asks the Provider to do it.

const { messages, status, sources, thoughts, ask } = useLLM({
  mode: 'agentic', // or 'rag' or 'qa'
  stream: true
});

Step-by-Step Execution Plan

| Step | Task | Why? | | --- | --- | --- | | 1 | Setup SSE Provider | Create a Context that holds a Map<string, RequestState>. | | 2 | Define Action Types | Create a useReducer to handle APPEND_TOKEN, SET_SOURCES, and SET_AGENT_STEP. | | 3 | Implement the Registry | Add logic to the Provider to manage new EventSource() calls and cleanup onClose. | | 4 | Standardize the Backend | Ensure your LLM server sends event: source and event: thought headers. | | 5 | Build the Hook | Create useLLM which dispatches "start" to the Provider and returns the reactive state. |

Why this works for you:

Scalable: You can start with simple Q&A. When you add RAG, you just start sending event: source from the backend; your hook is already wired to listen for it.
Resource Efficient: By centralizing the connections in the Provider, you can implement a "Global Kill Switch" to close all SSE connections if the user logs out or the tab goes background.

Would you like me to generate the LLMProvider code that includes the request registry and connection limits?

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

streaming-llm-responses

What you get

Usage

Notes

Architectural Decisions

SSE Connections and Concurrency

The Problem: Browser Connection Limits

The Solution: The "Singleton" Hub Pattern

1. Centralized SSE Manager

2. Use React Context or State Management

3. HTTP/2 is a Game Changer

Comparison of Approaches

1. The "Infrastructure" Shortcut: HTTP/2 (Most Common)

2. The "State Management" Layer (React/Frontend)

3. The "Shared Worker" (Cross-Tab Multiplexing)

Comparison of Modern Strategies

Summary for your LLM Hook

1. The Recommended Approach: The "Action-State" Hook

Why this approach?

2. Implementation Architecture

3. Key Design Features for Reusability

4. Handling the Connection Bottleneck

Summary: The "Agentic" Hook Lifecycle

Option A: Two Separate Hooks (useQuery and useAgent)

Option B: The "Unified" Hook (Recommended)

Why the Unified Hook is better for 2026:

How to Structure the Unified Hook

The Verdict

Next Step

Why Provider + Subscription over Shared Workers?

The Implementation Plan

Phase 1: The Registry (The Singleton Hub)

Phase 2: The Unified Schema (Q&A → RAG → Agent)

Phase 3: The "Smart" Hook (useLLM)

Step-by-Step Execution Plan

Why this works for you:

Option A: Two Separate Hooks (`useQuery` and `useAgent`)

Phase 3: The "Smart" Hook (`useLLM`)