@onkernel/cua-ai
v0.3.2
Published
Kernel-curated computer-use model access built on pi-ai
Keywords
Readme
@onkernel/cua-ai
Extension of @earendil-works/pi-ai's
unified LLM API with computer-use specific models, providers, and tool schemas
for building CUA agents on Kernel.
Installation
npm install @onkernel/cua-aiPrerequisites
You need an API key for each provider you call. The helpers in this package check these environment variables, in order:
| Provider | Environment variables (checked in order) |
| ----------- | ------------------------------------------- |
| openai | OPENAI_API_KEY |
| anthropic | ANTHROPIC_OAUTH_TOKEN, ANTHROPIC_API_KEY |
| google | GOOGLE_API_KEY, GEMINI_API_KEY |
| tzafon | TZAFON_API_KEY |
| yutori | YUTORI_API_KEY |
The exported helpers wrap this table:
cuaApiKeyEnvVarsForProvider(provider)— the env var names for a provider (accepts"gemini"as an alias for"google").getCuaEnvApiKey(provider)— read the key, orundefinedwhen unset.requireCuaEnvApiKey(provider)— read the key, or throw naming the variables to set.getCuaEnvApiKeyForModel(refOrModel)/requireCuaEnvApiKeyForModel(refOrModel)— the same, keyed by a model ref like"openai:gpt-5.5"or a concreteModel<Api>.
Pass the resolved key as the apiKey stream option (as in the Quick Start
below) so a missing key fails loudly before any request is made. If you omit
apiKey, pi-ai's built-in providers fall back to their own env lookup
(OPENAI_API_KEY; ANTHROPIC_OAUTH_TOKEN/ANTHROPIC_API_KEY; for google
only GEMINI_API_KEY, not GOOGLE_API_KEY), and this package's Tzafon/Yutori
stream adapters read TZAFON_API_KEY/YUTORI_API_KEY.
Quick Start
import { readFile } from "node:fs/promises";
import { complete, getCuaModel, openai, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
const model = getCuaModel("openai:gpt-5.5");
const apiKey = requireCuaEnvApiKeyForModel("openai:gpt-5.5"); // throws unless OPENAI_API_KEY is set
// Any screenshot of the page you want to act on, resolved relative to this
// module so the snippet does not depend on the process working directory.
const screenshot = await readFile(new URL("./screenshot.png", import.meta.url));
const response = await complete(
model,
{
systemPrompt: "You are a browser automation agent.",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Click the sign in / up link in this screenshot." },
{ type: "image", data: screenshot.toString("base64"), mimeType: "image/png" },
],
timestamp: Date.now(),
},
],
tools: openai.computerTools({ actions: ["click"] }),
},
{ apiKey },
);
if (response.stopReason === "error" || response.stopReason === "aborted") {
throw new Error(response.errorMessage ?? `request ended with stopReason "${response.stopReason}"`);
}
for (const block of response.content) {
if (block.type === "toolCall" && block.name === "click") {
console.log("click:", block.arguments);
}
}A runnable version ships at examples/quickstart.ts
(with a sample screenshot). In this repo, run it from packages/ai with
npm run example:quickstart; switch providers with the CUA_MODEL env var,
e.g. CUA_MODEL=anthropic:claude-opus-4-7.
Error Handling
pi-ai's complete() and stream() resolve instead of throwing when a
request fails. The returned AssistantMessage carries the outcome on
stopReason:
"stop","length","toolUse"— success;contentholds the response."error"— the provider call failed (bad API key, no model access, network error, …).contentis empty anderrorMessageholds the provider error."aborted"— the request was cancelled via thesignalstream option.
Always check stopReason before reading content — otherwise a typo'd API
key looks like a successful run that produced nothing:
if (response.stopReason === "error" || response.stopReason === "aborted") {
throw new Error(response.errorMessage ?? `request ended with stopReason "${response.stopReason}"`);
}getCuaModel(), requireCuaEnvApiKey*(), and computerTools({ actions })
validate eagerly and throw regular errors.
Continuing the Loop
@onkernel/cua-agent
runs this loop for you — CuaAgent/CuaAgentHarness classes with browser
execution against a Kernel browser. Reach for it first; the rest of this
section is for driving the loop yourself against your own browser stack.
A computer-use session is a loop: the model calls a tool, you execute it
against a real browser, and you send the result (with a fresh screenshot) back
so the model can plan the next step. Tool results are pi-ai
ToolResultMessages:
type ToolResultMessage = {
role: "toolResult";
toolCallId: string; // ToolCall.id from the assistant message
toolName: string; // ToolCall.name
content: (TextContent | ImageContent)[];
details?: unknown; // optional executor metadata, not sent to the model
isError: boolean;
timestamp: number;
};A minimal two-turn loop:
import { complete, getCuaModel, openai, requireCuaEnvApiKeyForModel, type Message } from "@onkernel/cua-ai";
const model = getCuaModel("openai:gpt-5.5");
const apiKey = requireCuaEnvApiKeyForModel("openai:gpt-5.5");
const tools = openai.computerTools({ actions: ["click", "type", "screenshot"] });
const messages: Message[] = [
{
role: "user",
content: [
{ type: "text", text: "Click the sign in / up link in this screenshot." },
{ type: "image", data: screenshotBase64, mimeType: "image/png" },
],
timestamp: Date.now(),
},
];
// Turn 1: the model responds with tool calls.
const first = await complete(model, { messages, tools }, { apiKey });
if (first.stopReason === "error" || first.stopReason === "aborted") {
throw new Error(first.errorMessage);
}
messages.push(first); // the AssistantMessage joins the transcript as-is
// Execute each tool call against your browser stack, then append a
// toolResult message carrying a fresh screenshot.
for (const block of first.content) {
if (block.type !== "toolCall") continue;
const freshScreenshotBase64 = await runInYourBrowser(block.name, block.arguments);
messages.push({
role: "toolResult",
toolCallId: block.id,
toolName: block.name,
content: [
{ type: "text", text: "done" },
{ type: "image", data: freshScreenshotBase64, mimeType: "image/png" },
],
isError: false,
timestamp: Date.now(),
});
}
// Turn 2: the model sees the results and plans the next action.
const second = await complete(model, { messages, tools }, { apiKey });Core Concepts
@onkernel/cua-ai re-exports the full surface of
@earendil-works/pi-ai
(export * from "@earendil-works/pi-ai"), including the core primitives:
Model, Context, Message, Tool, complete, stream, completeSimple,
streamSimple, Type, Static, TSchema, and the event/validation helpers.
Some familiarity with pi-ai is assumed; Kernel adds the computer-use model
catalog and provider/tool metadata.
Model Refs
getCuaModel() accepts only provider-qualified model refs of the form
<provider>:<model-id>:
getCuaModel("openai:gpt-5.5");
getCuaModel("anthropic:claude-opus-4-7");
getCuaModel("google:gemini-3-flash-preview");
getCuaModel("tzafon:tzafon.northstar-cua-fast");
getCuaModel("yutori:n1.5-latest");getCuaModel(ref) returns a pi-ai Model<Api> you can pass to complete()
or stream(). It throws when the ref names a model without a CUA-support
annotation.
See docs/supported-models.md for the current
list of CUA-supporting models per provider.
CuaProvider
CuaProvider is the string union of provider IDs this package targets:
type CuaProvider = "openai" | "anthropic" | "google" | "tzafon" | "yutori";The IDs match pi-ai's Model.provider values exactly. providerForModel(model)
narrows a pi-ai Model<Api> to a CuaProvider.
Listing Models
listCuaModels(provider?) returns every CUA-supporting model, optionally
filtered to one provider:
interface CuaModelInfo {
ref: CuaModelRef;
provider: CuaProvider;
model: string;
name: string;
}Exports
Everything below is importable from the package root. pi-ai's full surface is re-exported alongside (see Core Concepts).
Models and refs
getCuaModel(ref: CuaModelRef): Model<Api>listCuaModels(provider?: CuaProvider): CuaModelInfo[]parseCuaModelRef(ref: string): { provider: CuaProvider; model: string }— accepts the"gemini:"aliasformatCuaModelRef(provider, model): CuaModelRefproviderForModel(model: Model<Api>): CuaProviderisCuaProvider(value: string): value is CuaProviderfindCuaAnnotation(provider, modelId): CuaModelAnnotation | undefinedCUA_PROVIDERS: readonly CuaProvider[]CUA_MODEL_ANNOTATIONS: Record<CuaProvider, readonly CuaModelAnnotation[]>— the source-cited support table- Types:
CuaProvider,CuaModelRef,CuaModelInfo,CuaModelAnnotation,CuaModelMatch
API keys
cuaApiKeyEnvVarsForProvider(provider): readonly string[]getCuaEnvApiKey(provider): string | undefinedrequireCuaEnvApiKey(provider): stringgetCuaEnvApiKeyForModel(refOrModel): string | undefinedrequireCuaEnvApiKeyForModel(refOrModel): string
Runtime specs
resolveCuaRuntimeSpec(input: CuaModelRef | Model<Api>, options?: ComputerToolsOptions): CuaRuntimeSpec- Types:
CuaRuntimeSpec,CuaRuntimeSpecInput,CuaProviderModule,CuaScreenshotSpec,CuaScreenshotTransformSpec,CuaPayloadHook,CuaPayloadContext
resolveCuaRuntimeSpec() centralizes provider-specific defaults for
runtime consumers:
- canonical provider id
- provider-facing CUA tool definitions used in model requests
- local execution adapters used by
CuaAgent/CuaAgentHarness - default system prompt text
- provider coordinate convention
- optional provider screenshot input policy
- optional provider payload middleware (for protocol quirks)
Pass options (e.g. { actions: ["click"] }) to narrow the resolved tool
definitions and executors; it is forwarded to the provider module's
toolDefinitions()/toolExecutors(), so providers with a restricted subset
(Anthropic, Yutori) throw on unsupported actions.
Canonical actions and tools
CUA_ACTION_TYPES: readonly CuaActionType[]— the 16 canonical action namescomputerTools(options?: ComputerToolsOptions): Tool[]/createCuaActionToolDefinitions(actions?)— oneToolper canonical action (the full canonical superset; provider namespaces apply provider defaults and validation on top)computerToolExecutors(options?)/createCuaActionToolExecutors(actions?)— matchingCuaToolExecutorSpec[]execution adapterscreateCuaActionSchema(actions?),CuaActionSchema— TypeBox union schemacreateCuaBatchSchema(actions?),CuaBatchSchema,createCuaBatchToolDefinition(actions?, options?),createCuaBatchToolExecutor(actions?, options?),CUA_BATCH_TOOL_NAME("computer_batch"),CUA_BATCH_TOOL_DESCRIPTIONcreateCuaNavigationToolDefinition(),CuaNavigationSchema,CUA_NAVIGATION_TOOL_NAME("computer_use_extra"),CUA_NAVIGATION_TOOL_DESCRIPTIONcreateCuaPlaywrightToolDefinition(),CuaPlaywrightSchema,CUA_PLAYWRIGHT_TOOL_NAME("playwright_execute"),CUA_PLAYWRIGHT_TOOL_DESCRIPTIONcanonicalToolCallName(action),canonicalToolCallArguments(action)— map a normalizedCuaActionback to its tool-call name/argumentsnormalizeGotoUrl(value)— prefix bare hostnames withhttps://- Types:
CuaAction(plus the 16 per-action interfaces),CuaActionType,CuaMouseButton,CuaDragMouseButton,CuaBatchInput,CuaNavigationInput,CuaPlaywrightInput,CuaToolExecutorSpec,ComputerToolsOptions,ComputerToolCoordinateSystem
Provider registration
registerCuaProviders(): void— re-register the Yutori/Tzafon stream providers with pi-ai's global registry (runs automatically on import; idempotent; call it after any pi-ai registry mutator)
Provider Tools
Provider namespaces expose computerTools({ actions? }) for
building the provider's default CUA Tool[] definitions. These are the tools
sent to the model when you call complete() or stream() directly. The
default set can differ by provider: Anthropic includes its computer_batch
tool from the computer-use best-practices reference, while providers such as
OpenAI currently expose individual canonical browser actions. Omit actions
for the provider's default computer tool set, or pass an action subset to narrow
the schema for a single complete() call:
import { openai } from "@onkernel/cua-ai";
const allComputerTools = openai.computerTools();
const clickOnlyTools = openai.computerTools({ actions: ["click"] });When actions is provided, it must be a subset of that provider's supported
canonical action set; unsupported actions throw (e.g.
anthropic.computerTools({ actions: ["back"] }) throws
unsupported Anthropic canonical action(s): back).
Per-provider canonical action subsets (each namespace exports its list as
<PROVIDER>_CUA_ACTION_TYPES):
| Namespace | Canonical actions |
| ----------- | ---------------------------------------------------------------------------------- |
| openai | all 16 |
| anthropic | 13 — everything except back, forward, url; adds computer_batch by default |
| gemini | all 16 |
| tzafon | all 16 (replaced on the wire by Tzafon's native computer_use tool) |
| yutori | 13 — everything except screenshot, url, cursor_position (local mirrors only) |
Runtime specs also include toolExecutors: provider-owned adapters that use
the same tool-call names as the model-facing tools and translate their
arguments into canonical CUA actions for @onkernel/cua-agent. For most
providers, toolDefinitions and toolExecutors line up one-for-one. Some
providers are different on the wire: Yutori exposes browser actions through its
documented tool_set request field, so its runtime spec has no model-facing
toolDefinitions (yutori.providerModule.toolDefinitions() is []) but
still provides local toolExecutors for the canonical actions emitted after
Yutori's native tool calls are normalized. yutori.computerTools() builds
local mirrors of those canonical tools — they are never sent to the API
(streamYutori strips them from the outbound payload) and exist so the
normalized tool calls have matching local definitions/executors. Caller-provided
tools that should remain on the provider payload can be preserved by payload
middleware via CuaPayloadContext.keepToolNames.
Provider namespaces also expose coordinateSystem(), which returns the
coordinates the provider's computer tool calls are expected to emit:
openai.coordinateSystem()
// { type: "pixel" }
gemini.coordinateSystem()
// { type: "normalized", range: [0, 999] }Current coordinate contracts:
openai: pixel coordinatesanthropic: pixel coordinates, matching Anthropic's computer-use quickstartgemini: normalized coordinates in the 0-999 range (source)yutori: normalized coordinates in the 0-1000 range (source, SDK helper)tzafon: normalized coordinates in the 0-999 range (source, model card)
The action vocabulary is intentionally provider-neutral and OpenAI-shaped because it maps cleanly to most browser computer-use APIs:
type CuaAction =
| CuaActionClick
| CuaActionDoubleClick
| CuaActionMouseDown
| CuaActionMouseUp
| CuaActionTypeText
| CuaActionKeypress
| CuaActionScroll
| CuaActionMove
| CuaActionDrag
| CuaActionWait
| CuaActionScreenshot
| CuaActionGoto
| CuaActionBack
| CuaActionForward
| CuaActionUrl
| CuaActionCursorPosition;For example:
type CuaActionClick = {
type: "click";
x: number;
y: number;
button?: CuaMouseButton; // "left" | "right" | "middle" | "back" | "forward"
hold_keys?: string[];
};
type CuaActionGoto = {
type: "goto";
url: string;
};Mouse buttons are closed unions: CuaMouseButton for click/mouse_down/
mouse_up and CuaDragMouseButton ("left" | "right" | "middle") for
drag. Executors coerce anything outside the set to "left". keys stays
string[] — the agent-side key-alias table passes unrecognized keys through.
createCuaBatchToolDefinition(actions?, options?) builds a batch tool schema
whose input is:
type CuaBatchInput = {
actions: CuaAction[];
};Providers can include a batch tool when their model is expected to use one.
Anthropic does this by default with computer_batch (also exported as
anthropic.ANTHROPIC_BATCH_TOOL_NAME, equal to the top-level
CUA_BATCH_TOOL_NAME); Yutori does not.
createCuaBatchToolExecutor() is the matching execution adapter for turning
that provider-defined batch input into canonical CUA actions.
createCuaNavigationToolDefinition() can synthesize a computer_use_extra
navigation tool whose input is:
type CuaNavigationInput = {
action: "goto" | "back" | "forward" | "url";
url?: string;
};Provider Namespaces
Every provider namespace (openai, anthropic, gemini, tzafon,
yutori) follows one convention:
computerTools(options?)andcomputerToolExecutors(options?)createActionSchema(actions?)— TypeBox schema for the provider's subsetcoordinateSystem()build<Provider>SystemPrompt({ suffix? })and<PROVIDER>_COMPUTER_INSTRUCTIONS(the prompt text)<PROVIDER>_CUA_ACTION_TYPES— the supported canonical action subset<Provider>Actiontype — the canonical action union for that subsetComputerToolsOptionstype (Anthropic's addsexcludeBatch, also exported asAnthropicComputerToolsOptions)providerModule— the uniformCuaProviderModuleobject thatresolveCuaRuntimeSpeclooks up
Provider-specific extras:
openai:openaiResponsesStoreOnPayloadpayload hook, plus thecomputer_use_extranavigation aliasesOPENAI_EXTRA_TOOL_NAME,OPENAI_EXTRA_TOOL_DESCRIPTION,OpenAIExtraSchema,OpenAIExtraInputanthropic:ANTHROPIC_BATCH_TOOL_NAME("computer_batch")tzafon: thetzafon-responsesstream adapter (TZAFON_RESPONSES_API,streamTzafonResponses,streamSimpleTzafonResponses,TzafonResponsesOptionswithkeepToolNames),tzafonComputerUseOnPayloadpayload middleware,tzafonToolCallId, and the native-to-canonical normalizertoCanonicalActions(+TzafonCanonicalAction)yutori: theyutori-chat-completionsstream adapter (YUTORI_CHAT_COMPLETIONS_API,streamYutori,streamSimpleYutori,YutoriOptionswithkeepToolNames),yutoriNativeToolSetOnPayloadpayload middleware, the native Navigator action sets (YUTORI_N1_ACTION_TYPES,YUTORI_N15_CORE_ACTION_TYPES,YUTORI_N15_EXPANDED_ACTION_TYPES,YUTORI_N15_ACTION_TYPES, theYUTORI_N15_CORE_TOOL_SET/YUTORI_N15_EXPANDED_TOOL_SETtool-set ids, and the matchingYutori*ActionTypetypes),yutoriToolSetForModel,yutoriNativeActionsForModel, and the native-to-canonical normalizertoCanonicalActions
This package does not execute browser actions. Use @onkernel/cua-agent when
you want model tool calls executed against a Kernel browser.
