npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ocr-click-plugin

v2.2.4

Published

An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis

Readme

OCR Click Plugin

An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens. This plugin leverages Tesseract.js for text recognition, Sharp for image enhancement, and Google Cloud Vertex AI for intelligent screen analysis.

Features

  • 🔍 Advanced OCR: Uses Tesseract.js with optimized configuration for mobile screens
  • 🖼️ Image Enhancement: Preprocessing with Sharp for better text recognition
  • 🤖 AI-Powered Analysis: Google Cloud Vertex AI integration for intelligent screen understanding
  • 🎯 Confidence Filtering: Only considers text matches above configurable confidence threshold
  • 📱 Cross-Platform: Works with both iOS (XCUITest) and Android (UiAutomator2) drivers
  • 🔧 Configurable: Customizable OCR parameters and image processing options
  • 📊 Detailed Logging: Progress tracking and confidence scores for debugging

Installation

Prerequisites

  • Node.js 14+
  • Appium 2.x
  • iOS/Android drivers installed
  • Google Cloud Project with Vertex AI API enabled (for AI features)

Install the Plugin

# Clone the repository
git clone <your-repo-url>
cd ocr-click-plugin

# Install dependencies
npm install

# Build the plugin
npm run build

# Install plugin to Appium
npm run install-plugin

Google Cloud Setup (for AI Features)

  1. Create a Google Cloud Project
  2. Enable the Vertex AI API
  3. Set up authentication (Service Account or Application Default Credentials)
  4. Set environment variables:
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1"  # or your preferred location
export GOOGLE_MODEL="gemini-1.5-flash"  # or gemini-1.5-pro
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Development Setup

# Run development server (uninstall, build, install, and start server)
npm run dev

# Or run individual commands
npm run build
npm run reinstall-plugin
npm run run-server

API Endpoints

1. Text Click API

Find and click text elements using OCR.

POST /session/{sessionId}/appium/plugin/textclick

Parameters: | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | text | string | Yes | - | Text to search for and click | | index | number | No | 0 | Index of match to click (if multiple matches found) |

Response:

{
  "success": true,
  "message": "Clicked on text 'Login' at index 0",
  "totalMatches": 2,
  "confidence": 87.5,
  "imageEnhanced": true
}

2. Text Check API

Check if text is present on screen without clicking.

POST /session/{sessionId}/appium/plugin/checktext

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | text | string | Yes | Text to search for |

Response:

{
  "success": true,
  "isPresent": true,
  "totalMatches": 1,
  "searchText": "Submit",
  "matches": [
    {
      "text": "Submit",
      "confidence": 92.3,
      "coordinates": { "x": 200, "y": 400 },
      "bbox": { "x0": 150, "y0": 380, "x1": 250, "y1": 420 }
    }
  ],
  "imageEnhanced": true,
  "message": "Text 'Submit' found with 1 match(es)"
}

3. AI Analysis API (NEW)

Analyze screen content using Google Cloud Vertex AI.

POST /session/{sessionId}/appium/plugin/askllm

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | instruction | string | Yes | Natural language instruction for AI analysis |

Response:

{
  "success": true,
  "instruction": "What buttons are visible on this screen?",
  "response": {
    "candidates": [
      {
        "content": {
          "parts": [
            {
              "text": "I can see several buttons on this screen: 'Login', 'Sign Up', 'Forgot Password', and 'Help'. The Login button appears to be the primary action button."
            }
          ]
        }
      }
    ]
  },
  "message": "AI analysis completed successfully"
}

Usage Examples

Mobile Commands (Recommended)

// JavaScript/TypeScript
const driver = await remote(capabilities);

// Click text using mobile command
await driver.execute('mobile: textclick', { text: 'Login', index: 0 });

// Check if text exists
const result = await driver.execute('mobile: checktext', { text: 'Welcome' });
console.log(result.isPresent); // true/false

// AI screen analysis
const aiResult = await driver.execute('mobile: askllm', { 
  instruction: 'What are the main actions a user can take on this screen?' 
});
console.log(aiResult.response.candidates[0].content.parts[0].text);

Java Examples

import io.appium.java_client.android.AndroidDriver;
import org.openqa.selenium.remote.DesiredCapabilities;
import java.util.HashMap;
import java.util.Map;

public class OCRClickExample {
    public static void main(String[] args) {
        AndroidDriver driver = new AndroidDriver(serverUrl, capabilities);
        
        // Click text
        Map<String, Object> clickParams = new HashMap<>();
        clickParams.put("text", "Submit");
        clickParams.put("index", 0);
        Object result = driver.executeScript("mobile: textclick", clickParams);
        
        // Check text presence
        Map<String, Object> checkParams = new HashMap<>();
        checkParams.put("text", "Error");
        Object checkResult = driver.executeScript("mobile: checktext", checkParams);
        
        // AI analysis
        Map<String, Object> aiParams = new HashMap<>();
        aiParams.put("instruction", "Describe the layout and main elements of this screen");
        Object aiResult = driver.executeScript("mobile: askllm", aiParams);
        
        System.out.println("AI Response: " + aiResult);
    }
}

Python Examples

from appium import webdriver

driver = webdriver.Remote('http://localhost:4723/wd/hub', capabilities)

# Click text
result = driver.execute_script('mobile: textclick', {'text': 'Login'})
print(f"Click result: {result}")

# Check text
check_result = driver.execute_script('mobile: checktext', {'text': 'Welcome'})
print(f"Text present: {check_result['isPresent']}")

# AI analysis
ai_result = driver.execute_script('mobile: askllm', {
    'instruction': 'What form fields are visible and what information do they require?'
})
print(f"AI Analysis: {ai_result['response']['candidates'][0]['content']['parts'][0]['text']}")

Direct HTTP API

# Text click
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/textclick \
  -H "Content-Type: application/json" \
  -d '{"text": "Sign Up", "index": 0}'

# Text check
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/checktext \
  -H "Content-Type: application/json" \
  -d '{"text": "Error Message"}'

# AI analysis
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/askllm \
  -H "Content-Type: application/json" \
  -d '{"instruction": "What are the key UI elements and their purposes on this screen?"}'

AI Analysis Use Cases

The askllm API enables powerful screen analysis capabilities:

Screen Understanding

await driver.execute('mobile: askllm', { 
  instruction: 'Describe the main purpose of this screen and its key components' 
});

Element Identification

await driver.execute('mobile: askllm', { 
  instruction: 'List all clickable buttons and their likely functions' 
});

Form Analysis

await driver.execute('mobile: askllm', { 
  instruction: 'What form fields are present and what type of information do they expect?' 
});

Error Detection

await driver.execute('mobile: askllm', { 
  instruction: 'Are there any error messages or warnings visible on this screen?' 
});

Navigation Guidance

await driver.execute('mobile: askllm', { 
  instruction: 'How would a user navigate to the settings page from this screen?' 
});

Environment Variables

Required for AI Features

# Google Cloud Configuration
GOOGLE_PROJECT_ID=your-gcp-project-id
GOOGLE_LOCATION=us-central1
GOOGLE_MODEL=gemini-1.5-flash
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Alternative: Use gcloud CLI authentication
# gcloud auth application-default login

Optional Configuration

# OCR Configuration
OCR_CONFIDENCE_THRESHOLD=60
OCR_LANGUAGE=eng

# Image Processing
ENABLE_IMAGE_ENHANCEMENT=true
SHARP_IGNORE_GLOBAL_LIBVIPS=1

Configuration

OCR Settings

The plugin uses optimized Tesseract configuration:

const TESSERACT_CONFIG = {
  lang: 'eng',
  tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-_@#$%^&*()',
  tessedit_pageseg_mode: '6', // Uniform text block
  preserve_interword_spaces: '1',
  // ... other optimizations
};

Confidence Threshold

Default minimum confidence threshold is 60%. Words below this confidence are filtered out:

const MIN_CONFIDENCE_THRESHOLD = 60;

Image Enhancement

The plugin applies several image processing steps:

  1. Grayscale conversion - Reduces noise
  2. Normalization - Enhances contrast
  3. Sharpening - Improves text clarity
  4. Gamma correction - Better text contrast
  5. Median filtering - Removes noise
  6. Binary thresholding - Clear text separation

Troubleshooting

Google Cloud Setup Issues

Authentication Error:

# Set up application default credentials
gcloud auth application-default login

# Or use service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

API Not Enabled:

# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com

Model Not Available: Try different model names:

  • gemini-1.5-flash (faster, cheaper)
  • gemini-1.5-pro (more capable)
  • gemini-1.0-pro-vision (legacy)

Sharp Installation Issues

If you encounter Sharp compilation errors during installation, especially with Node.js v24+:

# Method 1: Use environment variable
SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install ocr-click-plugin

# Method 2: Install Sharp separately first
SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install --include=optional sharp
npm install ocr-click-plugin

# Method 3: For Appium plugin installation
SHARP_IGNORE_GLOBAL_LIBVIPS=1 appium plugin install ocr-click-plugin

Text Not Found

  • Check confidence threshold: Lower MIN_CONFIDENCE_THRESHOLD if text is not being detected
  • Verify text spelling: Ensure exact text match (case-insensitive)
  • Check image quality: Poor screenshots may affect OCR accuracy

Inconsistent Results

  • Image enhancement: The plugin includes advanced preprocessing to improve consistency
  • Confidence filtering: Only high-confidence matches are considered
  • Character whitelist: Limits recognition to expected characters

Performance Issues

  • Reduce image size: Large screenshots take longer to process
  • Optimize configuration: Adjust Tesseract parameters for your use case
  • Check device performance: Ensure adequate resources

Development

Project Structure

ocr-click-plugin/
├── src/
│   └── index.ts          # Main plugin implementation
├── dist/                 # Compiled JavaScript
├── package.json          # Dependencies and scripts
├── tsconfig.json         # TypeScript configuration
└── README.md            # This file

Building

npm run build

Testing

npm test

Available Scripts

npm run dev          # Full development workflow
npm run build        # Compile TypeScript
npm run install-plugin    # Install to Appium
npm run reinstall-plugin  # Uninstall and reinstall
npm run run-server   # Start Appium server
npm run uninstall    # Remove from Appium

Technical Details

Dependencies

  • @appium/base-plugin: Appium plugin framework
  • tesseract.js: OCR engine
  • sharp: Image processing
  • typescript: Development language

Supported Platforms

  • ✅ Android (UiAutomator2)
  • ✅ iOS (XCUITest)

Image Processing Pipeline

  1. Capture screenshot via Appium driver
  2. Convert to grayscale for better OCR
  3. Apply normalization and sharpening
  4. Gamma correction for text contrast
  5. Noise reduction with median filter
  6. Binary threshold for clear text separation
  7. OCR recognition with Tesseract
  8. Confidence filtering and text matching
  9. Coordinate calculation and click action

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the ISC License - see the LICENSE file for details.

Changelog

Version 1.0.0

  • Initial release with OCR text detection and clicking
  • Advanced image preprocessing for better accuracy
  • Confidence-based filtering for consistent results
  • Support for multiple text matches with index selection
  • Comprehensive logging and error handling