greatescrape
v0.1.1
Published
đź‘€GreatEscrape CLI for geographically decomposed web scraping with OSM, DuckDuckGo, Ollama, PinchTab, and Obsidian-friendly output.
Maintainers
Readme
đź‘€GreatEscrape
đź‘€GreatEscrape is a CLI that fans a search out across geographic subdivisions, expands the search phrase with lightweight synonym generation, collects candidate pages from DuckDuckGo, deterministically extracts contact details, and writes SQLite state, Obsidian-friendly notes, and raw text logs.
It now also supports PinchTab as an optional rendered-page backend, so JavaScript-heavy targets can be extracted through a real browser without giving up the default fast HTTP path.
Brand: đź‘€GreatEscrape
Command: greatescrape
What it does
For a command like:
greatescrape bars toronto contactthe pipeline is:
- Resolve
torontowith OpenStreetMap Nominatim. - Expand Toronto into nearby subdivisions with Overpass.
- Generate search variants for
barsthrough Ollama, with a fallback if Ollama is unavailable. - Search DuckDuckGo for every
(variant x subdivision)query. - Fetch the pages and extract emails, phones, hours, addresses, and relevant snippets.
- Save structured JSON, Obsidian markdown notes, and raw text logs.
If PinchTab is configured, page extraction can:
- stay on the fast HTTP extractor when direct fetches already expose contact signals
- fall back to a rendered browser capture for JS-heavy pages
- run fully through PinchTab when you need browser-only extraction
Install
npm install -g greatescrapeFrom source:
git clone https://github.com/lmtlssss/greatescrape.git
cd greatescrape
npm install
npm run build
npm linkUsage
greatescrape <term> <location> [detail]
greatescrape /term ... /detail ... /location ...
greatescrape watch <job-path>
greatescrape status <job-path>
greatescrape token <job-path>
greatescrape serve <job-path>
greatescrape vps <job-path> --ssh-host greatescrape.lmtlssss.funExamples:
greatescrape bars toronto contact
greatescrape /term dog walker /detail contact /location parkdale toronto
greatescrape /term cafe /detail menu /location germany
greatescrape "ski resort" montana contact --max-locations 100 --max-results 4
greatescrape brewery quebec contact --vault ~/Obsidian/MyVault
greatescrape "roofing company" chicago contact --pinchtab-url http://127.0.0.1:9867
greatescrape "private dentist" london contact --pinchtab-url http://127.0.0.1:9867 --pinchtab-mode always
greatescrape watch ./greatescrape-output/my-job
greatescrape token ./greatescrape-output/my-job
greatescrape serve ./greatescrape-output/my-jobThe older greatescrape run ... form still works as a compatibility alias.
Slash sections capture every word until the next slash section, so multi-word locations and terms do not need rigid positional parsing.
Options
--detail <type>: override the optional positional detail value.--vault <path>: write Obsidian notes into an existing vault location.--output <path>: base output directory for the run.--ollama-model <name>: model to use for synonyms and parsing.--max-locations <number>: cap geographic subdivisions considered.--max-results <number>: cap DuckDuckGo results fetched per query.--synonyms <number>: number of synonym variants requested from Ollama.--concurrency <number>: page fetch concurrency.--pinchtab-url <url>: optional PinchTab server used for rendered-page extraction.--pinchtab-instance <id>: reuse a specific PinchTab instance instead of auto-selecting a running one.--pinchtab-mode <auto|always>:autokeeps the fast HTTP extractor first and falls back to PinchTab when needed;alwayscaptures every page through PinchTab.
Environment equivalents:
GREATESCRAPE_PINCHTAB_URLGREATESCRAPE_PINCHTAB_INSTANCE_IDGREATESCRAPE_PINCHTAB_MODE
Output
Each run creates:
job.sqlitejob.logrun-summary.jsonsearch-results.jsonentities.jsonraw/*.txtobsidian/entities/*.md
The markdown notes use YAML frontmatter so they work well with Obsidian Dataview-style queries.
Web Shell
The static public dashboard shell lives in site/. It is designed to deploy directly to Netlify and stay zero-knowledge by default:
- Run
greatescrape ...and keep the access token the CLI prints. - Run
greatescrape serve <job-path>to expose the local gateway for that job. - If you need the token again later, run
greatescrape token <job-path>locally. - Open the hosted shell and paste your gateway URL plus access token into the form.
The browser talks directly to your gateway using x-greatescrape-token, so the hosted shell does not need your raw crawl data copied into Netlify. The hosted shell no longer uses token-in-URL launch links.
PinchTab
PinchTab fits into GreatEscrape as a browser-sidecar, not as a replacement for the whole crawl engine.
Typical setup:
pinchtab
greatescrape "family lawyer" toronto contact --pinchtab-url http://127.0.0.1:9867Behavior:
- GreatEscrape reuses a running PinchTab instance when one is already available.
- If the server exposes orchestrator endpoints and no instance is running, GreatEscrape starts a temporary headless instance and stops it when the job ends.
- If you need a specific logged-in browser session, pass
--pinchtab-instance <id>.
Notes
- The current MVP uses DuckDuckGo HTML results and deterministic extraction first.
- Known directory and social pages are kept in
search-results.json, but the final entity dataset is biased toward pages with extracted contact signals. - Ollama is optional. If the local model is unavailable,
đź‘€GreatEscrapefalls back to simple term variants and regex/schema extraction. - PinchTab extraction uses rendered page text, so it is especially helpful for client-rendered sites and browser-only content.
- Overpass queries can still hit practical limits on extremely large geographies. The code is structured so deeper recursive chunking can be added cleanly next.
