weixin-sogou-cdp-cli

v0.1.0

Published

16 hours ago

CDP CLI for Sogou Weixin.

0High
0Medium
0Low

meomeodev

Sogou Weixin CDP CLI

CLI tool for Sogou Weixin article and account search workflows.

The project was generated from the online template repository https://github.com/meomeo-dev/cdp-cli-template and now includes a Sogou Weixin-specific adapter for searching WeChat articles and official accounts.

Features

Inspect the Sogou Weixin homepage and verify the live search form selector.
Search WeChat articles with Sogou's type=2 result page.
Search official accounts with Sogou's type=1 result page.
Save readable rendered pages into editable Markdown with Firefox Readability-style extraction.
Identify visible page controls and run guarded element activation when explicitly requested.
Return normalized JSON for titles, summaries, URLs, account names, publish text, images, and diagnostics.
Expose the same workflows through JSON-RPC over stdin/stdout.

Install

cd ~/projects/weixin-sogou-cdp-cli
npm install
npm run build

Run from source:

npm run dev -- describe

Run after build:

node dist/src/cli.js describe

Browser Options

Common options work on page-backed commands:

--session <slug> is required for page-backed commands and isolates state by session slug.
--cdp-url <url> attaches to an existing Chrome CDP endpoint and does not close that browser.
--chrome-path <path> selects Google Chrome for managed launches.
--user-data-dir <path> is an advanced root directory for session browser data.
--headless launches Chrome in headless mode when --cdp-url is not provided.
--timeout-ms <ms> controls CDP operation timeout.

--headless is covered by the same mainline 搜索 → 翻页 → 点击 → 阅读 workflow as headed Chrome. The browser runtime aligns UA, locale, timezone, platform, viewport, and screen geometry before page logic runs; live e2e coverage exercises page-2 finance/code/paper samples in headless mode. Site availability and live anti-spider decisions can still vary, so use headed or attached Chrome when you need visual debugging.

Managed launches only auto-detect Google Chrome or Google Chrome Canary. Chromium is not used as a fallback. If Google Chrome is not installed in a standard location, set --chrome-path or CHROME_PATH.

Use a different --session value for each parallel search. Missing sessions are rejected to avoid accidentally sharing browser state.

Attach to an existing Chrome:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/weixin-sogou-cdp-attach-demo

npm run dev -- --session attach-demo --cdp-url http://127.0.0.1:9222 inspect-home

Commands

wxsg is the short command alias. weixin-sogou-cdp remains available as the descriptive long command.

wxsg describe
wxsg --session demo --headless inspect-home
wxsg --session demo --headless search "AI" --type article --limit 5
wxsg --session demo --headless search "AI" --type article --limit 20 --pages 2
wxsg --session demo --headless search "AI" --type article --limit 5 --page 2
wxsg --session demo --headless search "人民日报" --type account --limit 5
ARTICLE_URL=$(wxsg --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
wxsg --session demo --headless read "$ARTICLE_URL" --from-search
wxsg --session demo --headless identify-elements "https://example.com" --intent "Learn more"
wxsg --session demo --headless activate "https://example.com" --intent "Learn more" --dry-run
wxsg --session demo --headless click "https://example.com" --intent "Learn more" --dry-run
wxsg --session demo --headless rpc

Source-mode equivalents:

npm run dev -- --session demo --headless inspect-home --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 5 --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json
npm run dev -- --session demo --headless search "AI" --type article --limit 5 --page 2 --format json
npm run dev -- --session demo --headless search "人民日报" --type account --limit 5 --format json
ARTICLE_URL=$(npm run --silent dev -- --session demo --headless search "AI" --type article --limit 20 --pages 2 --format json | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
npm run --silent dev -- --session demo --headless read "$ARTICLE_URL" --from-search --format json
npm run dev -- --session demo --headless identify-elements "https://example.com" --intent "Learn more"
npm run dev -- --session demo --headless activate "https://example.com" --intent "Learn more" --dry-run

JSON-RPC

wxsg rpc reads one JSON-RPC 2.0 request per line from stdin.

printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"system.describe"}' | npm run dev -- --session rpc-demo rpc
printf '%s\n' '{"jsonrpc":"2.0","id":2,"method":"site.search","params":{"query":"AI","type":"article","limit":3,"pages":2}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":3,"method":"page.read","params":{"url":"<href returned by site.search>","fromSearch":true}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":5,"method":"ui.identify","params":{"url":"https://example.com","intent":"Learn more","limit":3}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":6,"method":"ui.activate","params":{"url":"https://example.com","intent":"Learn more","dryRun":true}}' | npm run dev -- --session rpc-demo --headless rpc
printf '%s\n' '{"jsonrpc":"2.0","id":7,"method":"ui.click","params":{"url":"https://example.com","intent":"Learn more","dryRun":true}}' | npm run dev -- --session rpc-demo --headless rpc

Methods:

system.describe
site.inspectHome
site.search with { "query": "AI", "type": "article", "limit": 3, "pages": 2 }
page.read with { "url": "<href returned by site.search>", "fromSearch": true, "includeHtml": false }
ui.identify with { "url": "https://example.com", "intent": "Learn more", "limit": 3 }
ui.activate with { "url": "https://example.com", "intent": "Learn more", "dryRun": true }
ui.click, a compatibility alias for ui.activate

type defaults to article; supported values are article and account.

Search starts from the Sogou Weixin homepage, fills the search box, and submits the search form. It does not directly construct the first results URL. Pagination uses the result page's next-page control.

--pages <count> caps automatic pagination and is capped at 10. The default maximum is 10 pages.
--page <number> navigates to a specific result page and is also capped at 10.
Pagination diagnostics include each visited page plus visible page-number window fields: currentPage, visibleMin, visibleMax, and nextActionable.
Each search stores a session snapshot under the session root and adds per-item breadcrumbs: result page, result index, result page URL, item selector, and link selector.

Readable Markdown

wxsg read replays a stored Sogou result click in Chrome, waits for the resolved WeChat article DOM, extracts the main readable article with @mozilla/readability, normalizes article semantics, and converts the cleaned article HTML to Markdown with turndown plus GFM table rules.

This is the closest CLI equivalent to browser-extension “clip article” behavior while preserving the real 搜索 → 翻页 → 点击 → 阅读 path: it keeps the main title/body and drops most navigation, ads, and chrome. The Markdown output starts with the article title and source metadata so it is ready to save or edit. The exporter preserves fenced code blocks, inline code, ATX headings, bold text, and GFM tables where article HTML semantics allow it; it also annotates image captions and long-image notes when detectable. Use --format json --include-html if you need metadata and the cleaned Readability HTML alongside Markdown.

SEARCH_JSON=$(wxsg --session read-demo --headless search "AI" --type article --limit 20 --pages 2 --format json)
ARTICLE_URL=$(printf '%s' "$SEARCH_JSON" | jq -r '.items[] | select(.source.page == 2) | .href' | head -n 1)
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search > article.md
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --format json --include-html
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --output-dir ./article-bundle --save-assets
wxsg --session read-demo --headless read "$ARTICLE_URL" --from-search --save-assets --format json

When --save-assets is provided, read writes a bundle. If --output-dir is omitted, the default bundle root is:

./微信文章/<YYYY-MM-DD>/<账号>_<YYYY-MM-DD>_<标题>/

Path segments are normalized to strip filesystem-reserved characters, collapse repeated whitespace, and keep readable Chinese text where possible.

When an explicit --output-dir is provided, read writes a bundle with this structure:

article-bundle/
  article.md
  metadata.json
  readable.html
  preview.html
  assets/
    image-001.png
    image-002.jpg

article.md is the rewritten Markdown for local reading.
metadata.json stores source URL, export time, bundle paths, asset summary, and per-asset save status.
readable.html stores the cleaned Readability article HTML in a readable local wrapper.
preview.html renders article.md as a lightweight local preview for layout review.
assets/ contains downloaded article media when --save-assets is enabled.
If --save-assets is omitted, image links stay remote and no local bundle is written unless --output-dir is explicitly provided.
Direct opens of weixin.sogou.com/* and mp.weixin.qq.com/* are intentionally disabled for read; WeChat exports must go through search + read --from-search.

For Sogou result URLs, use --from-search after running search with the same --session. This keeps reading based on the real 搜索 → 翻页 → 点击 → 阅读 workflow（基于真实搜索→翻页→点击→阅读流程）, instead of opening transient result links directly: the reader loads the stored snapshot, replays the original search, paginates to the recorded result page, finds the recorded clickable result link and click rectangle, clicks it, follows same-tab or new-tab navigation, then runs Readability on the resolved content page.

SEARCH_JSON=$(wxsg --session bridge-demo --headless search "AI" --type article --limit 30 --pages 10)
ARTICLE_URL=$(printf '%s' "$SEARCH_JSON" | jq -r '.items[20].href')
wxsg --session bridge-demo --headless read "$ARTICLE_URL" --from-search > article.md

If search traversed to page 10 but the chosen URL came from page 3, read --from-search uses the stored page-3 breadcrumb instead of the current page, replays the original query to page 3, and fails if the clicked result can no longer be found there.

Implementation notes from tool research:

@mozilla/readability is mature, Apache-2.0, and matches Firefox Reader View behavior; it extracts and cleans article HTML, not Markdown.
turndown is MIT-licensed and keeps the dependency footprint small for HTML-to-Markdown conversion.
turndown-plugin-gfm adds GitHub Flavored Markdown table support on top of Turndown.
Chrome/Puppeteer remains necessary for JavaScript-rendered pages because extractors work best on rendered DOM.
defuddle is a promising MIT-licensed alternative that can emit Markdown directly, but this CLI currently uses the more established Readability/Turndown pair.

Element Actions

identify-elements and activate provide a guarded element action layer for explicit user-selected page controls. click remains available as a compatibility alias for activate.

The flow is:

Collect visible candidates from links, buttons, inputs, ARIA roles, tabindex targets, contenteditable nodes, and labels.
Score candidates by visibility, enabled/disabled state, actionability, selector match, and intent text match.
For activate, choose the best actionable candidate, scroll/focus it, then run the guarded activation.
Report the chosen candidate, all considered candidates, URL before/after, and whether navigation was observed.

wxsg --session ui-demo --headless identify-elements "https://example.com" --intent "Learn more" --limit 3
wxsg --session ui-demo --headless activate "https://example.com" --intent "Learn more" --dry-run
wxsg --session ui-demo --headless click "https://example.com" --intent "Learn more" --dry-run

Use --selector to constrain recognition when the intent is ambiguous:

wxsg --session ui-demo --headless activate "https://example.com" --selector "a" --intent "Learn more" --dry-run

Output Shape

Search returns:

{
  "query": "AI",
  "type": "article",
  "items": [
    {
      "kind": "article",
      "title": "Result title",
      "text": "Result summary",
      "href": "https://weixin.sogou.com/link?...",
      "accountName": "Account name",
      "publishedAt": "Published text",
      "imageUrl": "https://...",
      "source": {
        "page": 3,
        "resultIndex": 1,
        "resultPageUrl": "https://weixin.sogou.com/weixin?...&page=3",
        "itemSelector": "div > ul > li:nth-of-type(1)",
        "linkSelector": "div > ul > li:nth-of-type(1) > div > h3 > a"
      }
    }
  ],
  "session": {
    "searchId": "20260430T120000000Z-12345678",
    "storedAt": "2026-04-30T12:00:00.000Z"
  },
  "diagnostics": {
    "networkIdle": "observed",
    "noResult": false,
    "verificationDetected": false,
    "searchMode": "ui-form",
    "maxPages": 2,
    "visitedPages": [
      {
        "page": 1,
        "items": 10,
        "pagination": {
          "currentPage": 1,
          "visibleMin": 1,
          "visibleMax": 10,
          "nextActionable": true,
          "nextSelector": "#sogou_next"
        }
      }
    ]
  }
}

Read returns:

{
  "url": "https://example.com/article",
  "title": "Article title",
  "byline": "Author",
  "length": 1234,
  "markdown": "# Article title\n\nSource: https://example.com/article\n\n...",
  "textContent": "Plain extracted text...",
  "diagnostics": {
    "reader": "readability",
    "markdown": "turndown",
    "source": "rendered-dom"
  }
}

If Sogou Weixin presents a site verification page, the adapter reports SITE_ACTION_FAILED and stops.

Site Contract

Default selectors live in src/infrastructure/site/siteRegistry.ts:

Ready selector: form[name="searchForm"]
Search input selector: #query
Article result selector: .news-box .news-list > li
No-result selector: .b404-box
Verification selector: #seccodeInput, .seccodeImage, input[name="c"]

Environment variables can override selectors during exploration:

SITE_READY_SELECTOR='form[name="searchForm"]' \
SITE_SEARCH_INPUT_SELECTOR="#query" \
SITE_RESULT_ITEMS_SELECTOR=".news-box .news-list > li" \
npm run dev -- --session selector-demo --headless search "AI"

Development

npm run lint
npm run typecheck
npm run spec:validate
npm run test
npm run test:contract
npm run quality:check

Run live e2e checks only when network access and Sogou Weixin availability are acceptable:

npm run test:e2e

Release Smoke

npm run release:preflight
npm pack --pack-destination /tmp .
TMPDIR=$(mktemp -d /tmp/weixin-sogou-cdp-install-XXXXXX)
cd "$TMPDIR"
npm init -y >/dev/null
npm install /tmp/weixin-sogou-cdp-cli-0.1.0.tgz
./node_modules/.bin/weixin-sogou-cdp --version
./node_modules/.bin/wxsg describe --format json