npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@ui-tars/sdk

v1.2.3

Published

A powerful cross-platform(ANY device/platform) toolkit for building GUI automation agents for UI-TARS

Readme

@ui-tars/sdk Guide (Experimental)

NPM Downloads codecov

Overview

@ui-tars/sdk is a powerful cross-platform(ANY device/platform) toolkit for building GUI automation agents.

It provides a flexible framework to create agents that can interact with graphical user interfaces through various operators. It supports running on both Node.js and the Web Browser

classDiagram
    class GUIAgent~T extends Operator~ {
        +model: UITarsModel
        +operator: T
        +signal: AbortSignal
        +onData
        +run()
    }

    class UITarsModel {
        +invoke()
    }

    class Operator {
        <<interface>>
        +screenshot()
        +execute()
    }

    class NutJSOperator {
        +screenshot()
        +execute()
    }

    class WebOperator {
        +screenshot()
        +execute()
    }

    class MobileOperator {
        +screenshot()
        +execute()
    }

    GUIAgent --> UITarsModel
    GUIAgent ..> Operator
    Operator <|.. NutJSOperator
    Operator <|.. WebOperator
    Operator <|.. MobileOperator

Try it out

npx @ui-tars/cli start

Input your UI-TARS Model Service Config(baseURL, apiKey, model), then you can control your computer with CLI.

Need to install the following packages:
Ok to proceed? (y) y

│
◆  Input your instruction
│  _ Open Chrome
└

Agent Execution Process

sequenceDiagram
    participant user as User
    participant guiAgent as GUI Agent
    participant model as UI-TARS Model
    participant operator as Operator

    user -->> guiAgent: "`instruction` + <br /> `Operator.MANUAL.ACTION_SPACES`"

    activate user
    activate guiAgent

    loop status !== StatusEnum.RUNNING
        guiAgent ->> operator: screenshot()
        activate operator
        operator -->> guiAgent: base64, Physical screen size
        deactivate operator

        guiAgent ->> model: instruction + actionSpaces + screenshots.slice(-5)
        model -->> guiAgent: `prediction`: click(start_box='(27,496)')
        guiAgent -->> user: prediction, next action

        guiAgent ->> operator: execute(prediction)
        activate operator
        operator -->> guiAgent: success
        deactivate operator
    end

    deactivate guiAgent
    deactivate user

Basic Usage

Basic usage is largely derived from package @ui-tars/sdk, here's a basic example of using the SDK:

Note: Using nut-js(cross-platform computer control tool) as the operator, you can also use or customize other operators. NutJS operator that supports common desktop automation actions:

  • Mouse actions: click, double click, right click, drag, hover
  • Keyboard input: typing, hotkeys
  • Scrolling
  • Screenshot capture
import { GUIAgent } from '@ui-tars/sdk';
import { NutJSOperator } from '@ui-tars/operator-nut-js';

const guiAgent = new GUIAgent({
  model: {
    baseURL: config.baseURL,
    apiKey: config.apiKey,
    model: config.model,
  },
  operator: new NutJSOperator(),
  onData: ({ data }) => {
    console.log(data)
  },
  onError: ({ data, error }) => {
    console.error(error, data);
  },
});

await guiAgent.run('send "hello world" to x.com');

Handling Abort Signals

You can abort the agent by passing a AbortSignal to the GUIAgent signal option.

const abortController = new AbortController();

const guiAgent = new GUIAgent({
  // ... other config
  signal: abortController.signal,
});

// ctrl/cmd + c to cancel operation
process.on('SIGINT', () => {
  abortController.abort();
});

Configuration Options

The GUIAgent constructor accepts the following configuration options:

  • model: Model configuration(OpenAI-compatible API) or custom model instance
    • baseURL: API endpoint URL
    • apiKey: API authentication key
    • model: Model name to use
    • more options see OpenAI API
  • operator: Instance of an operator class that implements the required interface
  • signal: AbortController signal for canceling operations
  • onData: Callback for receiving agent data/status updates
    • data.conversations is an array of objects, IMPORTANT: is delta, not the whole conversation history, each object contains:
      • from: The role of the message, it can be one of the following:
        • human: Human message
        • gpt: Agent response
        • screenshotBase64: Screenshot base64
      • value: The content of the message
    • data.status is the current status of the agent, it can be one of the following:
      • StatusEnum.INIT: Initial state
      • StatusEnum.RUNNING: Agent is actively executing
      • StatusEnum.END: Operation completed
      • StatusEnum.MAX_LOOP: Maximum loop count reached
  • onError: Callback for error handling
  • systemPrompt: Optional custom system prompt
  • maxLoopCount: Maximum number of interaction loops (default: 25)

Status flow

stateDiagram-v2
    [*] --> INIT
    INIT --> RUNNING
    RUNNING --> RUNNING: Execute Actions
    RUNNING --> END: Task Complete
    RUNNING --> MAX_LOOP: Loop Limit Reached
    END --> [*]
    MAX_LOOP --> [*]

Advanced Usage

Operator Interface

When implementing a custom operator, you need to implement two core methods: screenshot() and execute().

Initialize

npm init to create a new operator package, configuration is as follows:

{
  "name": "your-operator-tool",
  "version": "1.0.0",
  "main": "./dist/index.js",
  "module": "./dist/index.mjs",
  "types": "./dist/index.d.ts",
  "scripts": {
    "dev": "rslib build --watch",
    "prepare": "npm run build",
    "build": "rsbuild",
    "test": "vitest"
  },
  "files": [
    "dist"
  ],
  "publishConfig": {
    "access": "public",
    "registry": "https://registry.npmjs.org"
  },
  "dependencies": {
    "jimp": "^1.6.0"
  },
  "peerDependencies": {
    "@ui-tars/sdk": "^1.2.0-beta.17"
  },
  "devDependencies": {
    "@ui-tars/sdk": "^1.2.0-beta.17",
    "@rslib/core": "^0.5.4",
    "typescript": "^5.7.2",
    "vitest": "^3.0.2"
  }
}

screenshot()

This method captures the current screen state and returns a ScreenshotOutput:

interface ScreenshotOutput {
  // Base64 encoded image string
  base64: string;
  // Device pixel ratio (DPR)
  scaleFactor: number;
}

execute()

This method performs actions based on model predictions. It receives an ExecuteParams object:

interface ExecuteParams {
  /** Raw prediction string from the model */
  prediction: string;
  /** Parsed prediction object */
  parsedPrediction: {
    action_type: string;
    action_inputs: Record<string, any>;
    reflection: string | null;
    thought: string;
  };
  /** Device Physical Resolution */
  screenWidth: number;
  /** Device Physical Resolution */
  screenHeight: number;
  /** Device DPR */
  scaleFactor: number;
  /** model coordinates scaling factor [widthFactor, heightFactor] */
  factors: Factors;
}

Advanced sdk usage is largely derived from package @ui-tars/sdk/core, you can create custom operators by extending the base Operator class:

import {
  Operator,
  type ScreenshotOutput,
  type ExecuteParams
  type ExecuteOutput,
} from '@ui-tars/sdk/core';
import { Jimp } from 'jimp';

export class CustomOperator extends Operator {
  // Define the action spaces and description for UI-TARS System Prompt splice
  static MANUAL = {
    ACTION_SPACES: [
      'click(start_box="") # click on the element at the specified coordinates',
      'type(content="") # type the specified content into the current input field',
      'scroll(direction="") # scroll the page in the specified direction',
      'finished() # finish the task',
      // ...more_actions
    ],
  };

  public async screenshot(): Promise<ScreenshotOutput> {
    // Implement screenshot functionality
    const base64 = 'base64-encoded-image';
    const buffer = Buffer.from(base64, 'base64');
    const image = await sharp(buffer).toBuffer();

    return {
      base64: 'base64-encoded-image',
      scaleFactor: 1
    };
  }

  async execute(params: ExecuteParams): Promise<ExecuteOutput> {
    const { parsedPrediction, screenWidth, screenHeight, scaleFactor } = params;
    // Implement action execution logic

    // if click action, get coordinates from parsedPrediction
    const [startX, startY] = parsedPrediction?.action_inputs?.start_coords || '';

    if (parsedPrediction?.action_type === 'finished') {
      // finish the GUIAgent task
      return { status: StatusEnum.END };
    }
  }
}

Required methods:

  • screenshot(): Captures the current screen state
  • execute(): Performs the requested action based on model predictions

Optional static properties:

  • MANUAL: Define the action spaces and description for UI-TARS Model understanding
    • ACTION_SPACES: Define the action spaces and description for UI-TARS Model understanding

Loaded into GUIAgent:

const guiAgent = new GUIAgent({
  // ... other config
  systemPrompt: `
  // ... other system prompt
  ${CustomOperator.MANUAL.ACTION_SPACES.join('\n')}
  `,
  operator: new CustomOperator(),
});

Custom Model Implementation

You can implement custom model logic by extending the UITarsModel class:

class CustomUITarsModel extends UITarsModel {
  constructor(modelConfig: { model: string }) {
    super(modelConfig);
  }

  async invoke(params: any) {
    // Implement custom model logic
    return {
      prediction: 'action description',
      parsedPredictions: [{
        action_type: 'click',
        action_inputs: { /* ... */ },
        reflection: null,
        thought: 'reasoning'
      }]
    };
  }
}

const agent = new GUIAgent({
  model: new CustomUITarsModel({ model: 'custom-model' }),
  // ... other config
});

Note: However, it is not recommended to implement a custom model because it contains a lot of data processing logic (including image transformations, scaling factors, etc.).

Planning

You can combine planning/reasoning models (such as OpenAI-o1, DeepSeek-R1) to implement complex GUIAgent logic for planning, reasoning, and execution:

const guiAgent = new GUIAgent({
  // ... other config
});

const planningList = await reasoningModel.invoke({
  conversations: [
    {
      role: 'user',
      content: 'buy a ticket from beijing to shanghai',
    }
  ]
})
/**
 * [
 *  'open chrome',
 *  'open trip.com',
 *  'click "search" button',
 *  'select "beijing" in "from" input',
 *  'select "shanghai" in "to" input',
 *  'click "search" button',
 * ]
 */

for (const planning of planningList) {
  await guiAgent.run(planning);
}