pc-use
v0.1.8
Published
Coordinate-only desktop control wrapper for agent harnesses
Readme
PCUse
Coordinate-only desktop control for agent harnesses.
Last updated: 2026-04-25
Packages
This repo now ships as:
- a pure Python package:
pcuse - a Python CLI:
pcuse - a compatibility CLI:
gui-interact/python3 gui_interact.py - an npm wrapper package:
pc-use
The desktop-control core is Python. The npm package is only a small Node wrapper that calls the Python CLI.
Install
Local Python development install:
python3 -m pip install -e .If the system blocks user installs with PEP 668, use the same override you used for PyAutoGUI:
python3 -m pip install --user --break-system-packages -e .Install from PyPI after publishing:
python3 -m pip install pcuseInstall from npm after publishing:
npm install pc-usePython API
from pcuse import CoordinateComputer
computer = CoordinateComputer()
print(computer.position())
print(computer.screenshot(output="output/current-screen.png"))
# Real actions:
# computer.click(842, 513)
# computer.drag(842, 513, to_x=1020, to_y=700)
# computer.drag(path=[{"x": 842, "y": 513}, {"x": 930, "y": 560}, {"x": 1020, "y": 700}])
# computer.type_text("hello world")CLI
pcuse position
pcuse screenshot --output output/current-screen.png
pcuse describe --prompt "Describe the current screen and the next likely action."
pcuse assist --goal "Verify npm trusted publishing for pc-use" --context "Chrome already has relevant GitHub and npm tabs open"
pcuse click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse keypress --key ctrl --key l
pcuse type --text "hello world"
pcuse keydown --key shift
pcuse keyup --key shiftThe CLI accepts both the legacy action names and the computer-tool-style aliases such as double_click and keypress. For richer pointer motion, it also accepts --to-x/--to-y, repeated --path x,y, and split --scroll-x/--scroll-y scroll axes.
The old script path is still supported:
python3 gui_interact.py positionnpm Wrapper
import { assist, click, describe, drag, keypress, position, screenshot, typeText } from "pc-use";
console.log(await position());
await screenshot({ output: "output/current-screen.png" });
console.log(await describe({ prompt: "Describe the current screen and the next likely action." }));
console.log(await assist({ goal: "Verify npm trusted publishing for pc-use", context: "Chrome already has relevant GitHub and npm tabs open", conversationId: "npm-trusted-check" }));
await click({ x: 842, y: 513 });
await drag({ x: 842, y: 513, toX: 1020, toY: 700 });
await keypress(["ctrl", "l"]);
await typeText("secret", { envName: "OPENCLAW_SECRET", sensitiveText: true });You can point the wrapper at a specific Python interpreter:
PCUSE_PYTHON=/path/to/python node npm/bin/pcuse.js positionIncluded webapp
This repo now includes a copy of the vision-compare screenshot-targeting webapp under webapp/:
webapp/vision-compare.htmlwebapp/vision-compare-app.jswebapp/grid-mobile-check.png
The copied webapp is configured to call the public Ollama/Getfrom Chat tunnel instead of a local Ollama URL:
- backend/API base:
https://llm.getfrom.net/getfrom-chat
That makes the frontend portable for repo use or static hosting while still hitting the existing tunnel-backed async API. In practice, this repo includes public API endpoints you can use to explore the hosted vision stack without bringing up your own local backend first.
Coordinate-only OpenClaw Computer Control
For OpenClaw agents, keep the desktop interface as a simple observe/act loop:
- Capture the current screen.
- Send the screenshot and screen size to the agent.
- Have the agent return one coordinate action.
- Execute that action with
pcuseorgui_interact.py. - Capture a fresh screenshot and repeat until the task is done.
This mode does not use Playwright, Chrome DevTools, DOM selectors, accessibility APIs, or an HTTP framework. It only uses the current PyAutoGUI-based coordinate path.
Observe
pcuse screenshot --output output/current-screen.pngTo return the PNG in the JSON response:
pcuse screenshot --include-image-base64Describe
Use describe when you want screenshot-aware guidance before the next click. It captures the current screen, sends it to the configured vision backend, and returns a practical text answer.
pcuse describe --prompt "Describe the visible browser tabs and tell me which page I am on."
pcuse describe --prompt "What is unusual or blocking on this screen?" --provider molmo --model MolmoWeb4B
pcuse describe --prompt "Summarize this screen for navigation." --output output/current-screen.png
pcuse describe --prompt "Summarize this screen for navigation." --usage-csv output/vision-usage.csvDefaults:
- provider:
openaiwhenOPENAI_API_KEYis set, otherwisemolmo - OpenAI model:
gpt-5.4-mini - Molmo model:
MolmoWeb4B - API base:
https://llm.getfrom.net/getfrom-chat
describe is meant to help choose the next action. It does not click by itself.
Assist
Use assist when you want the model to keep a small running conversation about the task instead of answering one screenshot in isolation. assist captures the current screen, includes your goal plus saved conversation history, and asks the model for structured next-step guidance.
pcuse assist --goal "Verify npm trusted publishing for pc-use" \
--context "Chrome already has the GitHub repo and npm pages open" \
--conversation-id npm-trusted-check
pcuse assist --conversation-id npm-trusted-check \
--user-input "The browser address bar is focused now. What next?"Typical response shape:
{
"analysis": "You appear to be on ...",
"next_action": {"action": "hotkey", "keys": ["ctrl", "l"]},
"needs_user_input": false,
"question": null,
"confidence": 0.82,
"memory": "Still trying to reach npm package settings"
}Notes:
- conversation state is stored locally under
.pcuse-assist/ - pass
--conversation-idto continue the same thread across turns assistsuggests the next action; it does not auto-execute it for you- pass
--usage-csv output/vision-usage.csvto append provider/model/token usage perdescribeorassistturn - you can also set
PCUSE_USAGE_CSV=/path/to/vision-usage.csvto enable logging by default
Act
pcuse click --x 842 --y 513
pcuse doubleclick --x 842 --y 513
pcuse double_click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse drag --path 842,513 --path 930,560 --path 1020,700
pcuse scroll --scroll-x 240 --scroll-y -600 --x 900 --y 700
pcuse type --text "hello world"
pcuse press --key enter
pcuse keypress --key ctrl --key l
pcuse keydown --key shift
pcuse click --x 500 --y 500
pcuse keyup --key shift
pcuse hotkey --key ctrl --key l
pcuse wait --seconds 1.5At the coordinate layer, password fields and protected application regions are not treated differently from any other focused input or screen location. If the agent clicks the right coordinate and sends keystrokes, the OS/application receives those keystrokes.
For sensitive text, prefer environment input so the typed value is not echoed by the command output:
OPENCLAW_SECRET='correct horse battery staple' \
pcuse type --text-env OPENCLAW_SECRET --sensitive-textPyAutoGUI's fail-safe remains enabled, so moving the pointer to the upper-left corner can abort runaway automation.
References
- Image analysis / vision: https://developers.openai.com/api/docs/guides/images-vision#analyze-images
- Computer use tools: https://developers.openai.com/api/docs/guides/tools-computer-use
Analyze images
Vision lets a model see and understand images, including text inside images. It can interpret objects, shapes, colors, and textures, with some limitations.
You can send multiple images in one request, but images count as tokens and are billed accordingly.
Image input requirements
Supported file types
- PNG (
.png) - JPEG (
.jpeg,.jpg) - WEBP (
.webp) - Non-animated GIF (
.gif)
Size limits
- Up to 512 MB total payload size per request
- Up to 1500 individual image inputs per request
Other requirements
- No watermarks or logos
- No NSFW content
- Clear enough for a human to understand
Detail levels
The detail parameter controls how much visual detail the model uses:
low: Fast, low-cost understanding using a 512x512 version of the imagehigh: Standard high-fidelity image understandingoriginal: Best for large, dense, spatially sensitive, or computer-use images. Available on GPT-5.4 and future modelsauto: Let the model choose
For computer use, localization, and click-accuracy tasks on GPT-5.4-family and future models, prefer detail: "original".
Prompting for coordinates vs computer use
The docs support two related but different patterns.
1. Coordinate localization from a screenshot
If you only need a click point or bounding-point estimate, the docs' strongest guidance is:
- default to
gpt-5.4-minifor this workflow unless you specifically want the largergpt-5.4model - in practical testing,
gpt-5.4-miniis the cheapest broadly usable click-accurate OpenAI model for this workflow - send the screenshot with
detail: "original" - avoid
detail: "high"or"low"for this use case - if you downscale before sending, remap the returned coordinates back to the original image space
I do not have a strong claim here yet about Claude; it has not been the focus of the side-by-side testing captured in this repo.
Also worth noting: Carlos's office machine has a 12 GB GPU running a Molmo2-based model that has been competitive for this use case. The public endpoint shipped with this repo exists partly so you can explore that path too.
Prompt style should stay plain and task-focused. A practical pattern is:
- identify the target in plain language
- ask for image pixel coordinates measured from the top-left corner
- request JSON only if your downstream code needs a rigid schema
Example:
Find the username input box and return JSON only in this exact shape:
{"target":"username input","x":123,"y":456,"confidence":0.0,"rationale":"brief reason"}
Coordinates must be image pixel coordinates measured from the screenshot top-left corner.That JSON schema is a useful app-side convention, not a special API requirement.
2. True computer use
If you want the model to navigate the UI itself, the docs recommend the built-in computer tool loop rather than asking for raw x,y clicks.
Prompt style should again be plain-language and goal-oriented:
Check whether the Filters panel is open. If it is not open, click Show filters.
Then type penguin in the search box. Use the computer tool for UI interaction.The documented loop is:
- send the task with the
computertool enabled - inspect the returned
computer_call - execute every action in
actions[]in order - send back a fresh screenshot as
computer_call_output - repeat until the model stops returning
computer_call
Computer-use vocabulary and integration modes
Documented action types include:
screenshotclickdouble_clickdragmovescrollkeypresstypewait
Common action fields shown in the docs include:
x,ybuttonpathscrollX,scrollYkeystext
pcuse now supports those interaction shapes directly at the coordinate layer:
double_clickis accepted alongsidedoubleclickkeypressis accepted alongsidepress/hotkeydragcan use--to-x/--to-yor repeated--pathscrollcan use--scroll-xand--scroll-y
The docs describe three main integration shapes:
- built-in
computertool loop - custom harness or tool layer, such as Playwright, Selenium, VNC, or MCP
- code-execution harness that mixes screenshots with scripts or DOM-based automation
Legacy preview note
Older integrations may still refer to:
- model:
computer-use-preview - tool:
computer_use_preview
The preview request shape exposed fields such as:
display_widthdisplay_heightenvironment: "browser"
For new implementations, prefer the GA computer tool flow.
Limitations
- Medical images: not suitable for specialized medical interpretation or advice
- Non-English text: may perform worse on non-Latin alphabets
- Small text: enlarge text when possible;
detail: "original"can help - Rotation: may misread rotated or upside-down text and images
- Visual elements: may struggle with graphs or styling differences like dashed vs dotted lines
- Spatial reasoning: can struggle with precise localization tasks
- Accuracy: may generate incorrect descriptions or captions
- Image shape: panoramic and fisheye images can be difficult
- Metadata and resizing: original filenames and metadata are not processed; resizing may affect dimensions during analysis
- Counting: object counts may be approximate
- CAPTCHAs: blocked for safety reasons
Model pricing notes
These are list prices, not capability rankings. gpt-5.4-nano is cheaper on paper, but gpt-5.4-mini is the cheaper OpenAI model in this repo's testing that is still reliably usable for click-accuracy work.
GPT-5.4
- Input: $2.50 / 1M tokens
- Cached input: $0.25 / 1M tokens
- Output: $15.00 / 1M tokens
GPT-5.4 mini
- Input: $0.75 / 1M tokens
- Cached input: $0.075 / 1M tokens
- Output: $4.50 / 1M tokens
GPT-5.4 nano
- Input: $0.20 / 1M tokens
- Cached input: $0.02 / 1M tokens
- Output: $1.25 / 1M tokens
