agentic-workflow-search-engine-mcp

v1.2.1

Published

a month ago

Agentic Workflow Search Engine MCP & CLI

0High
0Medium
0Low

Agentic Workflow Search Engine MCP

[JPN] OllamaとPlaywrightを活用した、自律的で多段階なWeb・学術論文検索とコンテンツ要約を行うModel Context Protocol (MCP) サーバーです。
[ENG] An autonomous, multi-stage web and academic search engine MCP server utilizing Ollama and Playwright.

日本語 | English

日本語

本プロジェクトは、AIエージェント（Claude Desktop、Cursor、Roo Codeなど）が、単なる検索キーワードのクエリを超えて、自律的かつ高度なWeb・学術調査タスクを実行できるように設計されたMCPサーバーです。

開発の背景・設計思想

本ツールは、単なる利便性だけでなく「安全性」と「透明性」を重視して設計されています。

サイト運営者への配慮: 大量リクエストを送りつけるスクレイピングではなく、あえてPlaywrightによる「ブラウザ経由での通常のアクセス手法」を採用し、サイト運営側のサーバー負荷を最小限に抑えています。
透明性の確保: ブラウザをHeadlessモードにせず意図的に可視化し、さらにOllamaの推論過程を別ウィンドウ（Viewer）で表示する仕様にしています。これにより、「AIが今どのページを見て、何を考えているのか」がブラックボックス化せず、ユーザーが常に監視・把握できるようになっています。
コンテキスト汚染の防止: ページごとにセッションを完全に区切り、長文は適切に文字数制限でチャンク分割して処理することで、LLMのハルシネーション（情報の混同）やメモリ溢れをシステムレベルで防いでいます。

主な機能

ブラウザ経由のクローリング＆Markdown/PDF抽出: Playwrightで対象のWebページにブラウザとしてアクセスし、ヘッダーや広告などを除外したMarkdownテキストへ抽出します。対象URLがPDFファイルの場合はダウンロードしてテキスト解析を行います。
ローカルOllamaによる情報抽出: ローカルで動作するOllamaのモデルを用いて、ユーザーの「検索意図 (intent)」に基づいた情報抽出と要約の生成を行います。
学術検索モード: arXivやPubMedの公式APIを利用して学術論文を直接検索・取得する専用モード（mode="academic"）を搭載しています。スクレイピング不要のため安定した論文検索が可能です。
自律的二段階検索: 最初の調査結果をAIが評価し、さらに深掘りすべきトピックが見つかった場合、自ら次の検索キーワードを設定して二次調査を実行します。
逆引き物理インデックス機能: 調査結果はすべて artifacts/ 配下に「検索キーワード」ごとに整理されて保存されます。各調査には index.csv というマップテーブルが自動生成され、MCPのリソース機能 (mcp://artifacts/{keyword}/index.csv) 経由でAIがこれらの生データへアクセスできます。

前提条件

動作には以下の環境が必要です：

Node.js: v18.0.0 以上
Ollama: ローカル環境で起動していること（デフォルト: http://127.0.0.1:11434）
- 使用するモデル（例: gemma4-e4b-custom-uncensored:latest や gemma:latest など）が事前にプルされている必要があります。
- モデル名やホストURLは config.json で自由に変更可能です。

インストール・実行方法

本ツールはnpmパッケージとして公開されているため、事前のインストールやGitクローンは不要です。npx コマンドで直接実行できます。

# 初回実行時にパッケージとPlaywrightブラウザが自動セットアップされます
npx aw-se --help

(※ソースコードを直接編集したい開発者の場合は、従来通り git clone してご利用ください)

AIアシスタントへの導入方法

npmパッケージとして公開されているため、各AIアシスタントのCLIや設定から簡単に導入可能です。

1. CLIコマンドを用いた導入方法 (推奨)

Claude Code / OpenAI Codex: 専用のコマンドを使うことで、自動で設定が完了します。

claude mcp add aw-se-mcp -- npx -y agentic-workflow-search-engine-mcp

codex mcp add aw-se-mcp -- npx -y agentic-workflow-search-engine-mcp

Cursor / Windsurf / Roo Code など: Universal MCP Installerを使用すると自動で設定ファイルに書き込まれます。

npx universal-mcp-installer install agentic-workflow-search-engine-mcp

2. 手動で設定ファイルに追記する場合 (GUIアプリ等)

Claude Desktop / Antigravity 2.0 などは、手動で設定ファイルに追記する必要があります。本リポジトリに同梱されている mcp_config_sample.json の内容をコピーし、各ツールの設定ファイルにある mcpServers 内に貼り付けてください。

Claude Desktop (Windows): %APPDATA%\Claude\claude_desktop_config.json
Claude Desktop (Mac/Linux): ~/Library/Application Support/Claude/claude_desktop_config.json
Antigravity 2.0: ~/.gemini/config/mcp_config.json

これだけで、次回起動時に自動で最新版がダウンロードされ、MCPサーバーとして認識されます。

提供されるツールとパラメータ仕様

MCPサーバーを登録すると、AIは以下の search_and_extract ツールを利用できるようになります。

ツール名: `search_and_extract`

キーワード検索を実行し、Playwrightで各ページの中身を取得・クレンジングしたのち、ローカルOllamaを用いて「検索意図」に沿って情報を抽出します。

パラメータ仕様:

| パラメータ名 | 型 | 必須 | デフォルト値 | 説明 | | :-------------- | :-------: | :-----: | :----------: | :---------------------------------------------------------------------------------------------------------------------------------------------------- | | keywords | string | Yes | - | 検索したいキーワードを入力します。（例: "量子コンピューター実用化 2026"） | | intent | string | Yes | - | 検索で『どういう情報を取捨選択し、どう整理してほしいか』という具体的な目的を入力します。（例: "企業別のロードマップと実用化予定時期を抽出"） | | limit | number | No | 5 | 検索結果から実際に巡回・クローリングする最大ページ数を指定します。 | | final_summary | boolean | No | false | 全ページを巡回し個別の抽出が終わった後に、全体の情報を総合した最終まとめレポートを生成するかどうか。 | | mode | string | No | "web" | 検索のモード。"web" (通常のWebクローリング) または "academic" (arXiv & PubMedの公式API経由の論文検索)。 | | deep_dive | string | No | "auto" | 自律的な二段階検索の挙動。"auto" (一次結果をAIが評価し自動で深掘り検索を実行)、または "none" (深掘りを行わず、推奨検索キーワードの提示に留める)。 |

設定ファイル (config.json) のカスタマイズ

本ツールはダウンロード時に同梱されている config.json の設定をデフォルト値として動作します。環境に合わせて適宜書き換えて使用してください。設定ファイルが存在しない場合（npx での初回起動時など）は、自動的にデフォルト設定ファイルが生成されます。

スタンドアロンCLIでの使い方

MCPサーバーとしてではなく、単体のコマンドラインツール（CLI）としてターミナルから直接実行し、調査レポートを出力させることも可能です。

# npx経由で実行（推奨・多機能）
npx aw-se --keywords "AIスマートグラス 最新動向" --intent "各メーカーのスペックと価格情報の抽出" --limit 3 --mode web --deep-dive auto --final-summary

# 位置引数による指定（簡易的）
# npx aw-se <キーワード> <検索意図> <件数> <最終要約フラグ: true/false>
npx aw-se "量子コンピューター 実用化 2026" "ロードマップの抽出" 3 true

成果物の構造

調査が完了すると、プロジェクトのルートにある artifacts/ ディレクトリ配下に、検索クエリに基づいた以下のフォルダおよびファイルが物理的に出力されます。

artifacts/
└── [検索キーワード]/
    ├── index.csv                   # クロールした全ページのタイトル、URL、対応ファイルパスの逆引きマップテーブル
    ├── summary.md                  # 全ページを横断した総合要約レポート（final_summaryが有効な場合）
    ├── page1_[ページタイトル].md   # キャプチャしたページの生テキストをクレンジングしたMarkdown
    └── page1_[ページタイトル].json # Ollamaによって検索意図に沿って構造化・抽出されたJSONデータ

ライセンス

English

This project is an MCP server designed to enable AI agents (such as Claude Desktop, Cursor, Roo Code, etc.) to perform highly autonomous and advanced web and academic research tasks, going far beyond simple keyword matching.

Motivation & Philosophy

This tool is designed with a strong emphasis on "Safety" and "Transparency," beyond mere convenience.

Be gentle to web servers: Instead of aggressive scraping that floods servers with requests, it uses standard browser automation via Playwright to navigate pages, minimizing the load on site operators.
Transparency: The browser is intentionally kept visible (not headless), and Ollama's reasoning process is displayed in a separate viewer window. This prevents the AI's actions from becoming a black box, allowing users to monitor exactly what the AI is viewing and thinking in real-time.
Preventing Context Pollution: By completely isolating sessions per page and chunking long texts appropriately, it systematically prevents LLM hallucinations (information mix-ups) and memory overflows.

Key Features

Browser-based Crawling & Markdown/PDF Extraction: Accesses target web pages via Playwright and extracts content into Markdown, filtering out noise like headers and ads. If the target URL is a PDF, it downloads and parses the text directly.
Local Ollama-Driven Content Refinement: Uses a local Ollama model to extract information and generate summaries based on the user's specific "search intent."
Academic Search Mode: Provides a dedicated mode (mode="academic") to search and retrieve academic papers directly using arXiv and PubMed official APIs, ensuring stable research without scraping.
Autonomous Secondary Deep-Dive: The AI evaluates the initial research findings and, if it identifies topics requiring further investigation, autonomously formulates the next search queries to execute a secondary deep-dive search.
Dynamic Resource Indexing: All research results are organized under the artifacts/ folder by search keyword, automatically generating an index.csv mapping table. AI agents can access these raw data files via the MCP resource URI scheme (mcp://artifacts/{keyword}/index.csv).

Prerequisites

Node.js: v18.0.0 or higher
Ollama: Must be running locally (default: http://127.0.0.1:11434)
- The model you plan to use (e.g., gemma4-e4b-custom-uncensored:latest or gemma:latest) must be pulled in advance.
- You can customize the model name and endpoint host in config.json.

Installation & Usage

Because this tool is published as an npm package, you do not need to clone the repository. You can run it instantly using npx.

# Run directly (Playwright browser will be setup automatically on first run)
npx aw-se --help

(If you wish to modify the source code, you can still git clone the repository as usual.)

Installation for AI Assistants

Since the package is published on npm, you can install it instantly using CLI commands or configuration files.

1. Quick Installation via CLI (Recommended)

Claude Code / OpenAI Codex: You can use their dedicated commands to automatically configure the server.

claude mcp add aw-se-mcp -- npx -y agentic-workflow-search-engine-mcp

codex mcp add aw-se-mcp -- npx -y agentic-workflow-search-engine-mcp

Cursor / Windsurf / Roo Code, etc.: Use the Universal MCP Installer to automatically configure your editor.

npx universal-mcp-installer install agentic-workflow-search-engine-mcp

2. Manual Configuration (for GUI Apps)

For apps like Claude Desktop and Antigravity 2.0, you need to manually add the server to their configuration files. Copy the contents of the included mcp_config_sample.json file and paste it into the mcpServers object in your tool's config file.

Claude Desktop (Windows): %APPDATA%\Claude\claude_desktop_config.json
Claude Desktop (Mac/Linux): ~/Library/Application Support/Claude/claude_desktop_config.json
Antigravity 2.0: ~/.gemini/config/mcp_config.json

That's it! The client will automatically download and run the latest version on startup.

Exposed Tool & Argument Specification

Once configured, the AI will gain access to the search_and_extract tool.

Tool Name: `search_and_extract`

Performs keyword searches, crawls each page using Playwright, sanitizes content to markdown, and utilizes a local Ollama model to refine and extract facts based on the search intent.

Parameters:

Customizing Configuration (config.json)

The tool operates using the settings in the included config.json as default values. Please modify them according to your environment. If the file is missing (e.g., first run via npx), default settings will be generated automatically.

Standalone CLI Usage

You can also run this program as a standalone command-line interface directly in your terminal.

# Executing via npx with flags (Recommended)
npx aw-se --keywords "quantum computing roadmap" --intent "Extract timeline and major players" --limit 3 --mode web --deep-dive auto --final-summary

# Executing via npx with positional arguments (Simplified)
# npx aw-se <keywords> [intent] [limit] [final_summary]
npx aw-se "quantum computing roadmap" "Extract timeline" 3 true

Output Artifacts Directory Structure

Once execution completes, research materials are saved in the artifacts/ folder:

artifacts/
└── [Keywords]/
    ├── index.csv                   # Dynamic index mapping crawled URLs to local files
    ├── summary.md                  # Comprehensive global synthesis (if final_summary=true)
    ├── page1_[PageTitle].md        # Crawled page content converted to sanitized Markdown
    └── page1_[PageTitle].json       # Structured JSON data containing the intent-refined facts