weixin-sogou-cdp-cli
v0.1.0
Published
CDP CLI for Sogou Weixin.
Readme
Sogou Weixin CDP CLI
CLI tool for Sogou Weixin article and account search workflows.
The project was generated from the online template repository https://github.com/meomeo-dev/cdp-cli-template and now includes a Sogou Weixin-specific adapter for searching WeChat articles and official accounts.
Features
- Inspect the Sogou Weixin homepage and verify the live search form selector.
- Search WeChat articles with Sogou's
type=2result page. - Search official accounts with Sogou's
type=1result page. - Save readable rendered pages into editable Markdown with Firefox Readability-style extraction.
- Identify visible page controls and run guarded element activation when explicitly requested.
- Return normalized JSON for titles, summaries, URLs, account names, publish text, images, and diagnostics.
- Expose the same workflows through JSON-RPC over stdin/stdout.
Install
cd ~/projects/weixin-sogou-cdp-cli
npm install
npm run buildRun from source:
npm run dev -- describeRun after build:
node dist/src/cli.js describeBrowser Options
Common options work on page-backed commands:
--session <slug>is required for page-backed commands and isolates state by session slug.--cdp-url <url>attaches to an existing Chrome CDP endpoint and does not close that browser.--chrome-path <path>selects Google Chrome for managed launches.--user-data-dir <path>is an advanced root directory for session browser data.--headlesslaunches Chrome in headless mode when--cdp-urlis not provided.--timeout-ms <ms>controls CDP operation timeout.
--headless is covered by the same mainline 搜索 → 翻页 → 点击 → 阅读 workflow as headed Chrome. The browser runtime aligns UA, locale, timezone, platform, viewport, and screen geometry before page logic runs; live e2e coverage exercises page-2 finance/code/paper samples in headless mode. Site availability and live anti-spider decisions can still vary, so use headed or attached Chrome when you need visual debugging.
Managed launches only auto-detect Google Chrome or Google Chrome Canary. Chromium is not used as a fallback. If Google Chrome is not installed in a standard location, set --chrome-path or CHROME_PATH.
Use a different --session value for each parallel search. Missing sessions are rejected to avoid accidentally sharing browser state.
Attach to an existing Chrome:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222 \
--user-data-dir=/tmp/weixin-sogou-cdp-attach-demo
npm run dev -- --session attach-demo --cdp-url http://127.0.0.1:9222 inspect-homeCommands
wxsg is the short command alias. weixin-sogou-cdp remains available as the descriptive long command.
wxsg describe
wxsg --session demo --headless inspect-home
wxsg --session demo --headless search "AI" --type article --limit 5
wxsg --session demo --headless search "AI" --type article --limit 20 --pages 2
wxsg --session demo --headless search "AI" --type article --limit 5 --page 2
wxsg --session demo --headless search "人民日报" --type account --limit 5
ARTICLE_URL=$(wxsg --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
wxsg --session demo --headless read "$ARTICLE_URL" --from-search
wxsg --session demo --headless identify-elements "https://example.com" --intent "Learn more"
wxsg --session demo --headless activate "https://example.com" --intent "Learn more" --dry-run
wxsg --session demo --headless click "https://example.com" --intent "Learn more" --dry-run
wxsg --session demo --headless rpcSource-mode equivalents:
npm run dev -- --session demo --headless inspect-home --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 5 --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 5 --page 2 --format json
npm run dev -- --session demo --headless search "人民日报" --type account --limit 5 --format json
ARTICLE_URL=$(npm run --silent dev -- --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
npm run --silent dev -- --session demo --headless read "$ARTICLE_URL" --from-search --format json
npm run dev -- --session demo --headless identify-elements "https://example.com" --intent "Learn more"
npm run dev -- --session demo --headless activate "https://example.com" --intent "Learn more" --dry-runJSON-RPC
wxsg rpc reads one JSON-RPC 2.0 request per line from stdin.
printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"system.describe"}' | npm run dev -- --session rpc-demo rpc
printf '%s\n' '{"jsonrpc":"2.0","id":2,"method":"site.search","params":{"query":"AI","type":"article","limit":3,"pages":2}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":3,"method":"page.read","params":{"url":"<href returned by site.search>","fromSearch":true}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":5,"method":"ui.identify","params":{"url":"https://example.com","intent":"Learn more","limit":3}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":6,"method":"ui.activate","params":{"url":"https://example.com","intent":"Learn more","dryRun":true}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":7,"method":"ui.click","params":{"url":"https://example.com","intent":"Learn more","dryRun":true}}' | npm run dev -- --session rpc-demo --headless rpcMethods:
system.describesite.inspectHomesite.searchwith{ "query": "AI", "type": "article", "limit": 3, "pages": 2 }page.readwith{ "url": "<href returned by site.search>", "fromSearch": true, "includeHtml": false }ui.identifywith{ "url": "https://example.com", "intent": "Learn more", "limit": 3 }ui.activatewith{ "url": "https://example.com", "intent": "Learn more", "dryRun": true }ui.click, a compatibility alias forui.activate
type defaults to article; supported values are article and account.
Search starts from the Sogou Weixin homepage, fills the search box, and submits the search form. It does not directly construct the first results URL. Pagination uses the result page's next-page control.
--pages <count>caps automatic pagination and is capped at 10. The default maximum is 10 pages.--page <number>navigates to a specific result page and is also capped at 10.- Pagination diagnostics include each visited page plus visible page-number window fields:
currentPage,visibleMin,visibleMax, andnextActionable. - Each search stores a session snapshot under the session root and adds per-item breadcrumbs: result page, result index, result page URL, item selector, and link selector.
Readable Markdown
wxsg read replays a stored Sogou result click in Chrome, waits for the resolved WeChat article DOM, extracts the main readable article with @mozilla/readability, normalizes article semantics, and converts the cleaned article HTML to Markdown with turndown plus GFM table rules.
This is the closest CLI equivalent to browser-extension “clip article” behavior while preserving the real 搜索 → 翻页 → 点击 → 阅读 path: it keeps the main title/body and drops most navigation, ads, and chrome. The Markdown output starts with the article title and source metadata so it is ready to save or edit. The exporter preserves fenced code blocks, inline code, ATX headings, bold text, and GFM tables where article HTML semantics allow it; it also annotates image captions and long-image notes when detectable. Use --format json --include-html if you need metadata and the cleaned Readability HTML alongside Markdown.
SEARCH_JSON=$(wxsg --session read-demo --headless search "AI" --type article --limit 20 --pages 2 --format json)
ARTICLE_URL=$(printf '%s' "$SEARCH_JSON" | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search > article.md
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --format json --include-html
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --output-dir ./article-bundle --save-assets
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --save-assets --format jsonWhen --save-assets is provided, read writes a bundle. If --output-dir is omitted, the default bundle root is:
./微信文章/<YYYY-MM-DD>/<账号>_<YYYY-MM-DD>_<标题>/Path segments are normalized to strip filesystem-reserved characters, collapse repeated whitespace, and keep readable Chinese text where possible.
When an explicit --output-dir is provided, read writes a bundle with this structure:
article-bundle/
article.md
metadata.json
readable.html
preview.html
assets/
image-001.png
image-002.jpgarticle.mdis the rewritten Markdown for local reading.metadata.jsonstores source URL, export time, bundle paths, asset summary, and per-asset save status.readable.htmlstores the cleaned Readability article HTML in a readable local wrapper.preview.htmlrendersarticle.mdas a lightweight local preview for layout review.assets/contains downloaded article media when--save-assetsis enabled.- If
--save-assetsis omitted, image links stay remote and no local bundle is written unless--output-diris explicitly provided. - Direct opens of
weixin.sogou.com/*andmp.weixin.qq.com/*are intentionally disabled forread; WeChat exports must go throughsearch+read --from-search.
For Sogou result URLs, use --from-search after running search with the same --session. This keeps reading based on the real 搜索 → 翻页 → 点击 → 阅读 workflow(基于真实搜索→翻页→点击→阅读流程), instead of opening transient result links directly: the reader loads the stored snapshot, replays the original search, paginates to the recorded result page, finds the recorded clickable result link and click rectangle, clicks it, follows same-tab or new-tab navigation, then runs Readability on the resolved content page.
SEARCH_JSON=$(wxsg --session bridge-demo --headless search "AI" --type article --limit 30 --pages 10)
ARTICLE_URL=$(printf '%s' "$SEARCH_JSON" | jq -r '.items[20].href')
wxsg --session bridge-demo --headless read "$ARTICLE_URL" --from-search > article.mdIf search traversed to page 10 but the chosen URL came from page 3, read --from-search uses the stored page-3 breadcrumb instead of the current page, replays the original query to page 3, and fails if the clicked result can no longer be found there.
Implementation notes from tool research:
@mozilla/readabilityis mature, Apache-2.0, and matches Firefox Reader View behavior; it extracts and cleans article HTML, not Markdown.turndownis MIT-licensed and keeps the dependency footprint small for HTML-to-Markdown conversion.turndown-plugin-gfmadds GitHub Flavored Markdown table support on top of Turndown.- Chrome/Puppeteer remains necessary for JavaScript-rendered pages because extractors work best on rendered DOM.
defuddleis a promising MIT-licensed alternative that can emit Markdown directly, but this CLI currently uses the more established Readability/Turndown pair.
Element Actions
identify-elements and activate provide a guarded element action layer for explicit user-selected page controls. click remains available as a compatibility alias for activate.
The flow is:
- Collect visible candidates from links, buttons, inputs, ARIA roles, tabindex targets, contenteditable nodes, and labels.
- Score candidates by visibility, enabled/disabled state, actionability, selector match, and intent text match.
- For
activate, choose the best actionable candidate, scroll/focus it, then run the guarded activation. - Report the chosen candidate, all considered candidates, URL before/after, and whether navigation was observed.
wxsg --session ui-demo --headless identify-elements "https://example.com" --intent "Learn more" --limit 3
wxsg --session ui-demo --headless activate "https://example.com" --intent "Learn more" --dry-run
wxsg --session ui-demo --headless click "https://example.com" --intent "Learn more" --dry-runUse --selector to constrain recognition when the intent is ambiguous:
wxsg --session ui-demo --headless activate "https://example.com" --selector "a" --intent "Learn more" --dry-runOutput Shape
Search returns:
{
"query": "AI",
"type": "article",
"items": [
{
"kind": "article",
"title": "Result title",
"text": "Result summary",
"href": "https://weixin.sogou.com/link?...",
"accountName": "Account name",
"publishedAt": "Published text",
"imageUrl": "https://...",
"source": {
"page": 3,
"resultIndex": 1,
"resultPageUrl": "https://weixin.sogou.com/weixin?...&page=3",
"itemSelector": "div > ul > li:nth-of-type(1)",
"linkSelector": "div > ul > li:nth-of-type(1) > div > h3 > a"
}
}
],
"session": {
"searchId": "20260430T120000000Z-12345678",
"storedAt": "2026-04-30T12:00:00.000Z"
},
"diagnostics": {
"networkIdle": "observed",
"noResult": false,
"verificationDetected": false,
"searchMode": "ui-form",
"maxPages": 2,
"visitedPages": [
{
"page": 1,
"items": 10,
"pagination": {
"currentPage": 1,
"visibleMin": 1,
"visibleMax": 10,
"nextActionable": true,
"nextSelector": "#sogou_next"
}
}
]
}
}Read returns:
{
"url": "https://example.com/article",
"title": "Article title",
"byline": "Author",
"length": 1234,
"markdown": "# Article title\n\nSource: https://example.com/article\n\n...",
"textContent": "Plain extracted text...",
"diagnostics": {
"reader": "readability",
"markdown": "turndown",
"source": "rendered-dom"
}
}If Sogou Weixin presents a site verification page, the adapter reports SITE_ACTION_FAILED and stops.
Site Contract
Default selectors live in src/infrastructure/site/siteRegistry.ts:
- Ready selector:
form[name="searchForm"] - Search input selector:
#query - Article result selector:
.news-box .news-list > li - No-result selector:
.b404-box - Verification selector:
#seccodeInput, .seccodeImage, input[name="c"]
Environment variables can override selectors during exploration:
SITE_READY_SELECTOR='form[name="searchForm"]' \
SITE_SEARCH_INPUT_SELECTOR="#query" \
SITE_RESULT_ITEMS_SELECTOR=".news-box .news-list > li" \
npm run dev -- --session selector-demo --headless search "AI"Development
npm run lint
npm run typecheck
npm run spec:validate
npm run test
npm run test:contract
npm run quality:checkRun live e2e checks only when network access and Sogou Weixin availability are acceptable:
npm run test:e2eRelease Smoke
npm run release:preflight
npm pack --pack-destination /tmp .
TMPDIR=$(mktemp -d /tmp/weixin-sogou-cdp-install-XXXXXX)
cd "$TMPDIR"
npm init -y >/dev/null
npm install /tmp/weixin-sogou-cdp-cli-0.1.0.tgz
./node_modules/.bin/weixin-sogou-cdp --version
./node_modules/.bin/wxsg describe --format json