pc-use

v0.1.8

Published

20 days ago

Coordinate-only desktop control wrapper for agent harnesses

0High
0Medium
0Low

karpatic

PCUse

Coordinate-only desktop control for agent harnesses.

Last updated: 2026-04-25

Packages

This repo now ships as:

a pure Python package: pcuse
a Python CLI: pcuse
a compatibility CLI: gui-interact / python3 gui_interact.py
an npm wrapper package: pc-use

The desktop-control core is Python. The npm package is only a small Node wrapper that calls the Python CLI.

Install

Local Python development install:

python3 -m pip install -e .

If the system blocks user installs with PEP 668, use the same override you used for PyAutoGUI:

python3 -m pip install --user --break-system-packages -e .

Install from PyPI after publishing:

python3 -m pip install pcuse

Install from npm after publishing:

npm install pc-use

Python API

from pcuse import CoordinateComputer

computer = CoordinateComputer()
print(computer.position())
print(computer.screenshot(output="output/current-screen.png"))

# Real actions:
# computer.click(842, 513)
# computer.drag(842, 513, to_x=1020, to_y=700)
# computer.drag(path=[{"x": 842, "y": 513}, {"x": 930, "y": 560}, {"x": 1020, "y": 700}])
# computer.type_text("hello world")

CLI

pcuse position
pcuse screenshot --output output/current-screen.png
pcuse describe --prompt "Describe the current screen and the next likely action."
pcuse assist --goal "Verify npm trusted publishing for pc-use" --context "Chrome already has relevant GitHub and npm tabs open"
pcuse click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse keypress --key ctrl --key l
pcuse type --text "hello world"
pcuse keydown --key shift
pcuse keyup --key shift

The CLI accepts both the legacy action names and the computer-tool-style aliases such as double_click and keypress. For richer pointer motion, it also accepts --to-x/--to-y, repeated --path x,y, and split --scroll-x/--scroll-y scroll axes.

The old script path is still supported:

python3 gui_interact.py position

npm Wrapper

import { assist, click, describe, drag, keypress, position, screenshot, typeText } from "pc-use";

console.log(await position());
await screenshot({ output: "output/current-screen.png" });
console.log(await describe({ prompt: "Describe the current screen and the next likely action." }));
console.log(await assist({ goal: "Verify npm trusted publishing for pc-use", context: "Chrome already has relevant GitHub and npm tabs open", conversationId: "npm-trusted-check" }));
await click({ x: 842, y: 513 });
await drag({ x: 842, y: 513, toX: 1020, toY: 700 });
await keypress(["ctrl", "l"]);
await typeText("secret", { envName: "OPENCLAW_SECRET", sensitiveText: true });

You can point the wrapper at a specific Python interpreter:

PCUSE_PYTHON=/path/to/python node npm/bin/pcuse.js position

Included webapp

This repo now includes a copy of the vision-compare screenshot-targeting webapp under webapp/:

webapp/vision-compare.html
webapp/vision-compare-app.js
webapp/grid-mobile-check.png

The copied webapp is configured to call the public Ollama/Getfrom Chat tunnel instead of a local Ollama URL:

backend/API base: https://llm.getfrom.net/getfrom-chat

That makes the frontend portable for repo use or static hosting while still hitting the existing tunnel-backed async API. In practice, this repo includes public API endpoints you can use to explore the hosted vision stack without bringing up your own local backend first.

Coordinate-only OpenClaw Computer Control

For OpenClaw agents, keep the desktop interface as a simple observe/act loop:

Capture the current screen.
Send the screenshot and screen size to the agent.
Have the agent return one coordinate action.
Execute that action with pcuse or gui_interact.py.
Capture a fresh screenshot and repeat until the task is done.

This mode does not use Playwright, Chrome DevTools, DOM selectors, accessibility APIs, or an HTTP framework. It only uses the current PyAutoGUI-based coordinate path.

Observe

pcuse screenshot --output output/current-screen.png

To return the PNG in the JSON response:

pcuse screenshot --include-image-base64

Describe

Use describe when you want screenshot-aware guidance before the next click. It captures the current screen, sends it to the configured vision backend, and returns a practical text answer.

pcuse describe --prompt "Describe the visible browser tabs and tell me which page I am on."
pcuse describe --prompt "What is unusual or blocking on this screen?" --provider molmo --model MolmoWeb4B
pcuse describe --prompt "Summarize this screen for navigation." --output output/current-screen.png
pcuse describe --prompt "Summarize this screen for navigation." --usage-csv output/vision-usage.csv

Defaults:

provider: openai when OPENAI_API_KEY is set, otherwise molmo
OpenAI model: gpt-5.4-mini
Molmo model: MolmoWeb4B
API base: https://llm.getfrom.net/getfrom-chat

describe is meant to help choose the next action. It does not click by itself.

Assist

Use assist when you want the model to keep a small running conversation about the task instead of answering one screenshot in isolation. assist captures the current screen, includes your goal plus saved conversation history, and asks the model for structured next-step guidance.

pcuse assist --goal "Verify npm trusted publishing for pc-use" \
  --context "Chrome already has the GitHub repo and npm pages open" \
  --conversation-id npm-trusted-check

pcuse assist --conversation-id npm-trusted-check \
  --user-input "The browser address bar is focused now. What next?"

Typical response shape:

{
  "analysis": "You appear to be on ...",
  "next_action": {"action": "hotkey", "keys": ["ctrl", "l"]},
  "needs_user_input": false,
  "question": null,
  "confidence": 0.82,
  "memory": "Still trying to reach npm package settings"
}

Notes:

conversation state is stored locally under .pcuse-assist/
pass --conversation-id to continue the same thread across turns
assist suggests the next action; it does not auto-execute it for you
pass --usage-csv output/vision-usage.csv to append provider/model/token usage per describe or assist turn
you can also set PCUSE_USAGE_CSV=/path/to/vision-usage.csv to enable logging by default

Act

pcuse click --x 842 --y 513
pcuse doubleclick --x 842 --y 513
pcuse double_click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse drag --path 842,513 --path 930,560 --path 1020,700
pcuse scroll --scroll-x 240 --scroll-y -600 --x 900 --y 700
pcuse type --text "hello world"
pcuse press --key enter
pcuse keypress --key ctrl --key l
pcuse keydown --key shift
pcuse click --x 500 --y 500
pcuse keyup --key shift
pcuse hotkey --key ctrl --key l
pcuse wait --seconds 1.5

At the coordinate layer, password fields and protected application regions are not treated differently from any other focused input or screen location. If the agent clicks the right coordinate and sends keystrokes, the OS/application receives those keystrokes.

For sensitive text, prefer environment input so the typed value is not echoed by the command output:

OPENCLAW_SECRET='correct horse battery staple' \
  pcuse type --text-env OPENCLAW_SECRET --sensitive-text

PyAutoGUI's fail-safe remains enabled, so moving the pointer to the upper-left corner can abort runaway automation.

References

Image analysis / vision: https://developers.openai.com/api/docs/guides/images-vision#analyze-images
Computer use tools: https://developers.openai.com/api/docs/guides/tools-computer-use

Analyze images

Vision lets a model see and understand images, including text inside images. It can interpret objects, shapes, colors, and textures, with some limitations.

You can send multiple images in one request, but images count as tokens and are billed accordingly.

Image input requirements

Supported file types

PNG (.png)
JPEG (.jpeg, .jpg)
WEBP (.webp)
Non-animated GIF (.gif)

Size limits

Up to 512 MB total payload size per request
Up to 1500 individual image inputs per request

Other requirements

No watermarks or logos
No NSFW content
Clear enough for a human to understand

Detail levels

The detail parameter controls how much visual detail the model uses:

low: Fast, low-cost understanding using a 512x512 version of the image
high: Standard high-fidelity image understanding
original: Best for large, dense, spatially sensitive, or computer-use images. Available on GPT-5.4 and future models
auto: Let the model choose

For computer use, localization, and click-accuracy tasks on GPT-5.4-family and future models, prefer detail: "original".

Prompting for coordinates vs computer use

The docs support two related but different patterns.

1. Coordinate localization from a screenshot

If you only need a click point or bounding-point estimate, the docs' strongest guidance is:

default to gpt-5.4-mini for this workflow unless you specifically want the larger gpt-5.4 model
in practical testing, gpt-5.4-mini is the cheapest broadly usable click-accurate OpenAI model for this workflow
send the screenshot with detail: "original"
avoid detail: "high" or "low" for this use case
if you downscale before sending, remap the returned coordinates back to the original image space

I do not have a strong claim here yet about Claude; it has not been the focus of the side-by-side testing captured in this repo.

Also worth noting: Carlos's office machine has a 12 GB GPU running a Molmo2-based model that has been competitive for this use case. The public endpoint shipped with this repo exists partly so you can explore that path too.

Prompt style should stay plain and task-focused. A practical pattern is:

identify the target in plain language
ask for image pixel coordinates measured from the top-left corner
request JSON only if your downstream code needs a rigid schema

Example:

Find the username input box and return JSON only in this exact shape:
{"target":"username input","x":123,"y":456,"confidence":0.0,"rationale":"brief reason"}
Coordinates must be image pixel coordinates measured from the screenshot top-left corner.

That JSON schema is a useful app-side convention, not a special API requirement.

2. True computer use

If you want the model to navigate the UI itself, the docs recommend the built-in computer tool loop rather than asking for raw x,y clicks.

Prompt style should again be plain-language and goal-oriented:

Check whether the Filters panel is open. If it is not open, click Show filters.
Then type penguin in the search box. Use the computer tool for UI interaction.

The documented loop is:

send the task with the computer tool enabled
inspect the returned computer_call
execute every action in actions[] in order
send back a fresh screenshot as computer_call_output
repeat until the model stops returning computer_call

Computer-use vocabulary and integration modes

Documented action types include:

screenshot
click
double_click
drag
move
scroll
keypress
type
wait

Common action fields shown in the docs include:

x, y
button
path
scrollX, scrollY
keys
text

pcuse now supports those interaction shapes directly at the coordinate layer:

double_click is accepted alongside doubleclick
keypress is accepted alongside press / hotkey
drag can use --to-x/--to-y or repeated --path
scroll can use --scroll-x and --scroll-y

The docs describe three main integration shapes:

built-in computer tool loop
custom harness or tool layer, such as Playwright, Selenium, VNC, or MCP
code-execution harness that mixes screenshots with scripts or DOM-based automation

Legacy preview note

Older integrations may still refer to:

model: computer-use-preview
tool: computer_use_preview

The preview request shape exposed fields such as:

display_width
display_height
environment: "browser"

For new implementations, prefer the GA computer tool flow.

Limitations

Medical images: not suitable for specialized medical interpretation or advice
Non-English text: may perform worse on non-Latin alphabets
Small text: enlarge text when possible; detail: "original" can help
Rotation: may misread rotated or upside-down text and images
Visual elements: may struggle with graphs or styling differences like dashed vs dotted lines
Spatial reasoning: can struggle with precise localization tasks
Accuracy: may generate incorrect descriptions or captions
Image shape: panoramic and fisheye images can be difficult
Metadata and resizing: original filenames and metadata are not processed; resizing may affect dimensions during analysis
Counting: object counts may be approximate
CAPTCHAs: blocked for safety reasons

Model pricing notes

These are list prices, not capability rankings. gpt-5.4-nano is cheaper on paper, but gpt-5.4-mini is the cheaper OpenAI model in this repo's testing that is still reliably usable for click-accuracy work.

GPT-5.4

Input: $2.50 / 1M tokens
Cached input: $0.25 / 1M tokens
Output: $15.00 / 1M tokens

GPT-5.4 mini

Input: $0.75 / 1M tokens
Cached input: $0.075 / 1M tokens
Output: $4.50 / 1M tokens

GPT-5.4 nano

Input: $0.20 / 1M tokens
Cached input: $0.02 / 1M tokens
Output: $1.25 / 1M tokens