npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pc-use

v0.1.8

Published

Coordinate-only desktop control wrapper for agent harnesses

Readme

PCUse

Coordinate-only desktop control for agent harnesses.

Last updated: 2026-04-25

Packages

This repo now ships as:

  • a pure Python package: pcuse
  • a Python CLI: pcuse
  • a compatibility CLI: gui-interact / python3 gui_interact.py
  • an npm wrapper package: pc-use

The desktop-control core is Python. The npm package is only a small Node wrapper that calls the Python CLI.

Install

Local Python development install:

python3 -m pip install -e .

If the system blocks user installs with PEP 668, use the same override you used for PyAutoGUI:

python3 -m pip install --user --break-system-packages -e .

Install from PyPI after publishing:

python3 -m pip install pcuse

Install from npm after publishing:

npm install pc-use

Python API

from pcuse import CoordinateComputer

computer = CoordinateComputer()
print(computer.position())
print(computer.screenshot(output="output/current-screen.png"))

# Real actions:
# computer.click(842, 513)
# computer.drag(842, 513, to_x=1020, to_y=700)
# computer.drag(path=[{"x": 842, "y": 513}, {"x": 930, "y": 560}, {"x": 1020, "y": 700}])
# computer.type_text("hello world")

CLI

pcuse position
pcuse screenshot --output output/current-screen.png
pcuse describe --prompt "Describe the current screen and the next likely action."
pcuse assist --goal "Verify npm trusted publishing for pc-use" --context "Chrome already has relevant GitHub and npm tabs open"
pcuse click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse keypress --key ctrl --key l
pcuse type --text "hello world"
pcuse keydown --key shift
pcuse keyup --key shift

The CLI accepts both the legacy action names and the computer-tool-style aliases such as double_click and keypress. For richer pointer motion, it also accepts --to-x/--to-y, repeated --path x,y, and split --scroll-x/--scroll-y scroll axes.

The old script path is still supported:

python3 gui_interact.py position

npm Wrapper

import { assist, click, describe, drag, keypress, position, screenshot, typeText } from "pc-use";

console.log(await position());
await screenshot({ output: "output/current-screen.png" });
console.log(await describe({ prompt: "Describe the current screen and the next likely action." }));
console.log(await assist({ goal: "Verify npm trusted publishing for pc-use", context: "Chrome already has relevant GitHub and npm tabs open", conversationId: "npm-trusted-check" }));
await click({ x: 842, y: 513 });
await drag({ x: 842, y: 513, toX: 1020, toY: 700 });
await keypress(["ctrl", "l"]);
await typeText("secret", { envName: "OPENCLAW_SECRET", sensitiveText: true });

You can point the wrapper at a specific Python interpreter:

PCUSE_PYTHON=/path/to/python node npm/bin/pcuse.js position

Included webapp

This repo now includes a copy of the vision-compare screenshot-targeting webapp under webapp/:

  • webapp/vision-compare.html
  • webapp/vision-compare-app.js
  • webapp/grid-mobile-check.png

The copied webapp is configured to call the public Ollama/Getfrom Chat tunnel instead of a local Ollama URL:

  • backend/API base: https://llm.getfrom.net/getfrom-chat

That makes the frontend portable for repo use or static hosting while still hitting the existing tunnel-backed async API. In practice, this repo includes public API endpoints you can use to explore the hosted vision stack without bringing up your own local backend first.

Coordinate-only OpenClaw Computer Control

For OpenClaw agents, keep the desktop interface as a simple observe/act loop:

  1. Capture the current screen.
  2. Send the screenshot and screen size to the agent.
  3. Have the agent return one coordinate action.
  4. Execute that action with pcuse or gui_interact.py.
  5. Capture a fresh screenshot and repeat until the task is done.

This mode does not use Playwright, Chrome DevTools, DOM selectors, accessibility APIs, or an HTTP framework. It only uses the current PyAutoGUI-based coordinate path.

Observe

pcuse screenshot --output output/current-screen.png

To return the PNG in the JSON response:

pcuse screenshot --include-image-base64

Describe

Use describe when you want screenshot-aware guidance before the next click. It captures the current screen, sends it to the configured vision backend, and returns a practical text answer.

pcuse describe --prompt "Describe the visible browser tabs and tell me which page I am on."
pcuse describe --prompt "What is unusual or blocking on this screen?" --provider molmo --model MolmoWeb4B
pcuse describe --prompt "Summarize this screen for navigation." --output output/current-screen.png
pcuse describe --prompt "Summarize this screen for navigation." --usage-csv output/vision-usage.csv

Defaults:

  • provider: openai when OPENAI_API_KEY is set, otherwise molmo
  • OpenAI model: gpt-5.4-mini
  • Molmo model: MolmoWeb4B
  • API base: https://llm.getfrom.net/getfrom-chat

describe is meant to help choose the next action. It does not click by itself.

Assist

Use assist when you want the model to keep a small running conversation about the task instead of answering one screenshot in isolation. assist captures the current screen, includes your goal plus saved conversation history, and asks the model for structured next-step guidance.

pcuse assist --goal "Verify npm trusted publishing for pc-use" \
  --context "Chrome already has the GitHub repo and npm pages open" \
  --conversation-id npm-trusted-check

pcuse assist --conversation-id npm-trusted-check \
  --user-input "The browser address bar is focused now. What next?"

Typical response shape:

{
  "analysis": "You appear to be on ...",
  "next_action": {"action": "hotkey", "keys": ["ctrl", "l"]},
  "needs_user_input": false,
  "question": null,
  "confidence": 0.82,
  "memory": "Still trying to reach npm package settings"
}

Notes:

  • conversation state is stored locally under .pcuse-assist/
  • pass --conversation-id to continue the same thread across turns
  • assist suggests the next action; it does not auto-execute it for you
  • pass --usage-csv output/vision-usage.csv to append provider/model/token usage per describe or assist turn
  • you can also set PCUSE_USAGE_CSV=/path/to/vision-usage.csv to enable logging by default

Act

pcuse click --x 842 --y 513
pcuse doubleclick --x 842 --y 513
pcuse double_click --x 842 --y 513
pcuse drag --x 842 --y 513 --to-x 1020 --to-y 700
pcuse drag --path 842,513 --path 930,560 --path 1020,700
pcuse scroll --scroll-x 240 --scroll-y -600 --x 900 --y 700
pcuse type --text "hello world"
pcuse press --key enter
pcuse keypress --key ctrl --key l
pcuse keydown --key shift
pcuse click --x 500 --y 500
pcuse keyup --key shift
pcuse hotkey --key ctrl --key l
pcuse wait --seconds 1.5

At the coordinate layer, password fields and protected application regions are not treated differently from any other focused input or screen location. If the agent clicks the right coordinate and sends keystrokes, the OS/application receives those keystrokes.

For sensitive text, prefer environment input so the typed value is not echoed by the command output:

OPENCLAW_SECRET='correct horse battery staple' \
  pcuse type --text-env OPENCLAW_SECRET --sensitive-text

PyAutoGUI's fail-safe remains enabled, so moving the pointer to the upper-left corner can abort runaway automation.

References

  • Image analysis / vision: https://developers.openai.com/api/docs/guides/images-vision#analyze-images
  • Computer use tools: https://developers.openai.com/api/docs/guides/tools-computer-use

Analyze images

Vision lets a model see and understand images, including text inside images. It can interpret objects, shapes, colors, and textures, with some limitations.

You can send multiple images in one request, but images count as tokens and are billed accordingly.

Image input requirements

Supported file types

  • PNG (.png)
  • JPEG (.jpeg, .jpg)
  • WEBP (.webp)
  • Non-animated GIF (.gif)

Size limits

  • Up to 512 MB total payload size per request
  • Up to 1500 individual image inputs per request

Other requirements

  • No watermarks or logos
  • No NSFW content
  • Clear enough for a human to understand

Detail levels

The detail parameter controls how much visual detail the model uses:

  • low: Fast, low-cost understanding using a 512x512 version of the image
  • high: Standard high-fidelity image understanding
  • original: Best for large, dense, spatially sensitive, or computer-use images. Available on GPT-5.4 and future models
  • auto: Let the model choose

For computer use, localization, and click-accuracy tasks on GPT-5.4-family and future models, prefer detail: "original".

Prompting for coordinates vs computer use

The docs support two related but different patterns.

1. Coordinate localization from a screenshot

If you only need a click point or bounding-point estimate, the docs' strongest guidance is:

  • default to gpt-5.4-mini for this workflow unless you specifically want the larger gpt-5.4 model
  • in practical testing, gpt-5.4-mini is the cheapest broadly usable click-accurate OpenAI model for this workflow
  • send the screenshot with detail: "original"
  • avoid detail: "high" or "low" for this use case
  • if you downscale before sending, remap the returned coordinates back to the original image space

I do not have a strong claim here yet about Claude; it has not been the focus of the side-by-side testing captured in this repo.

Also worth noting: Carlos's office machine has a 12 GB GPU running a Molmo2-based model that has been competitive for this use case. The public endpoint shipped with this repo exists partly so you can explore that path too.

Prompt style should stay plain and task-focused. A practical pattern is:

  • identify the target in plain language
  • ask for image pixel coordinates measured from the top-left corner
  • request JSON only if your downstream code needs a rigid schema

Example:

Find the username input box and return JSON only in this exact shape:
{"target":"username input","x":123,"y":456,"confidence":0.0,"rationale":"brief reason"}
Coordinates must be image pixel coordinates measured from the screenshot top-left corner.

That JSON schema is a useful app-side convention, not a special API requirement.

2. True computer use

If you want the model to navigate the UI itself, the docs recommend the built-in computer tool loop rather than asking for raw x,y clicks.

Prompt style should again be plain-language and goal-oriented:

Check whether the Filters panel is open. If it is not open, click Show filters.
Then type penguin in the search box. Use the computer tool for UI interaction.

The documented loop is:

  1. send the task with the computer tool enabled
  2. inspect the returned computer_call
  3. execute every action in actions[] in order
  4. send back a fresh screenshot as computer_call_output
  5. repeat until the model stops returning computer_call

Computer-use vocabulary and integration modes

Documented action types include:

  • screenshot
  • click
  • double_click
  • drag
  • move
  • scroll
  • keypress
  • type
  • wait

Common action fields shown in the docs include:

  • x, y
  • button
  • path
  • scrollX, scrollY
  • keys
  • text

pcuse now supports those interaction shapes directly at the coordinate layer:

  • double_click is accepted alongside doubleclick
  • keypress is accepted alongside press / hotkey
  • drag can use --to-x/--to-y or repeated --path
  • scroll can use --scroll-x and --scroll-y

The docs describe three main integration shapes:

  1. built-in computer tool loop
  2. custom harness or tool layer, such as Playwright, Selenium, VNC, or MCP
  3. code-execution harness that mixes screenshots with scripts or DOM-based automation

Legacy preview note

Older integrations may still refer to:

  • model: computer-use-preview
  • tool: computer_use_preview

The preview request shape exposed fields such as:

  • display_width
  • display_height
  • environment: "browser"

For new implementations, prefer the GA computer tool flow.

Limitations

  • Medical images: not suitable for specialized medical interpretation or advice
  • Non-English text: may perform worse on non-Latin alphabets
  • Small text: enlarge text when possible; detail: "original" can help
  • Rotation: may misread rotated or upside-down text and images
  • Visual elements: may struggle with graphs or styling differences like dashed vs dotted lines
  • Spatial reasoning: can struggle with precise localization tasks
  • Accuracy: may generate incorrect descriptions or captions
  • Image shape: panoramic and fisheye images can be difficult
  • Metadata and resizing: original filenames and metadata are not processed; resizing may affect dimensions during analysis
  • Counting: object counts may be approximate
  • CAPTCHAs: blocked for safety reasons

Model pricing notes

These are list prices, not capability rankings. gpt-5.4-nano is cheaper on paper, but gpt-5.4-mini is the cheaper OpenAI model in this repo's testing that is still reliably usable for click-accuracy work.

GPT-5.4

  • Input: $2.50 / 1M tokens
  • Cached input: $0.25 / 1M tokens
  • Output: $15.00 / 1M tokens

GPT-5.4 mini

  • Input: $0.75 / 1M tokens
  • Cached input: $0.075 / 1M tokens
  • Output: $4.50 / 1M tokens

GPT-5.4 nano

  • Input: $0.20 / 1M tokens
  • Cached input: $0.02 / 1M tokens
  • Output: $1.25 / 1M tokens