gal-scraper
v0.1.0
Published
TypeScript SDK for docs-scraper API
Maintainers
Readme
gal-scraper
TypeScript SDK for the docs-scraper API.
Installation
npm install gal-scraperRequirements
- Node.js 18+ (uses native fetch)
Usage
Basic Example
import { DocsScraperClient } from 'gal-scraper';
const client = new DocsScraperClient({
baseUrl: 'https://your-docs-scraper-instance.com',
apiSecret: 'your-api-secret',
});
// Submit a scrape job
const { jobId } = await client.scrape({
url: 'https://docs.example.com/page',
});
// Check job status
const job = await client.getJob(jobId);
console.log(job.status); // 'queued' | 'processing' | 'blocked' | 'completed' | 'failed' | 'cancelled'
// Get results when completed
if (job.status === 'completed') {
const { job: completedJob } = await client.getResults(jobId);
console.log(completedJob.result?.signedUrl);
}Handling Blocked Jobs
When a scrape job encounters authentication or other blocking requirements:
const job = await client.getJob(jobId);
if (job.status === 'blocked' && job.blockingReason) {
console.log(job.blockingReason.type); // 'auth' | 'captcha' | 'custom'
console.log(job.blockingReason.fields); // Fields that need to be filled
// Provide the required data with type safety
await client.update<{ email: string; password: string }>({
jobId,
data: {
email: '[email protected]',
password: 'password123',
},
});
}Providing Data Upfront
You can provide authentication data upfront when submitting a scrape job:
interface LoginCredentials {
username: string;
password: string;
}
const { jobId } = await client.scrape<LoginCredentials>({
url: 'https://protected-docs.example.com/page',
data: {
username: 'user',
password: 'pass',
},
});Session Management
Sessions allow reusing browser state across scrape jobs:
// List all sessions
const { sessions } = await client.listSessions();
// Use an existing session
const { jobId } = await client.scrape({
url: 'https://docs.example.com/another-page',
sessionId: sessions[0].id,
});
// Delete a session
await client.deleteSession(sessionId);
// Clear all sessions
await client.clearSessions();Job Management
// List jobs with pagination
const { jobs, total, page, limit } = await client.listJobs({
page: 1,
limit: 20,
});
// Cancel a job
const cancelledJob = await client.cancelJob(jobId);
// Delete a job
await client.deleteJob(jobId);Health Check
// No authentication required
const health = await client.health();
console.log(health.status); // 'ok'Error Handling
import {
DocsScraperClient,
AuthenticationError,
NotFoundError,
ValidationError,
ConflictError,
NetworkError,
TimeoutError,
} from 'gal-scraper';
try {
await client.getJob('non-existent-job');
} catch (error) {
if (error instanceof NotFoundError) {
console.log('Job not found');
} else if (error instanceof AuthenticationError) {
console.log('Invalid API secret');
} else if (error instanceof ValidationError) {
console.log('Invalid request:', error.message);
} else if (error instanceof ConflictError) {
console.log('Conflict:', error.message);
} else if (error instanceof NetworkError) {
console.log('Network error:', error.message);
} else if (error instanceof TimeoutError) {
console.log('Request timed out');
}
}Configuration Options
const client = new DocsScraperClient({
// Required
baseUrl: 'https://your-instance.com',
apiSecret: 'your-secret',
// Optional
timeout: 60000, // Request timeout in ms (default: 30000)
fetch: customFetch, // Custom fetch implementation
});API Reference
Client Methods
Scrape
scrape<TData>(options)- Submit a new scrape jobupdate<TData>(options)- Update a blocked job with provided datagetResults(jobId)- Get the results of a scrape job
Jobs
listJobs(options?)- List all jobs with paginationgetJob(jobId)- Get a specific job by IDcancelJob(jobId)- Cancel a jobdeleteJob(jobId)- Delete a job
Sessions
listSessions()- List all sessionsdeleteSession(sessionId)- Delete a specific sessionclearSessions()- Clear all sessions
Health
health()- Check API health (no auth required)
Types
All types are exported from the package:
import type {
Job,
JobStatus,
JobResult,
BlockingReason,
BlockingField,
Session,
ScrapeOptions,
UpdateOptions,
ClientOptions,
// ... and more
} from 'gal-scraper';License
MIT
