japan-travel-mcp

v1.0.1

Published

a month ago

Open Japanese tourism dataset + MCP server — 13,394 tourist spots × 17 languages, ~20,000 accommodations, 690 officially-designated cultural records. All 47 prefectures, all from public sources. Free, no API key.

Downloads

264

0High
0Medium
0Low

kjsunada

mcp model-context-protocol japan travel tourism ai-agents open-data

Japan Travel MCP

Open Japanese tourism dataset + Model Context Protocol (MCP) server — 13,394 tourist spots × 17 languages, about 20,000 accommodations, 690 officially-designated cultural records (MAFF GI, METI traditional crafts, Japan Heritage, UNESCO ICH). Free, no API key, no account. All data is downloadable as JSON/JSONL from Hugging Face.

What you can do in 60 seconds

1. Use it from any MCP client (Claude Desktop, Cursor, Windsurf):

{ "mcpServers": { "japan-travel": { "command": "npx", "args": ["-y", "japan-travel-mcp"] } } }

Then ask the model anything like "What's special about Tottori, in Russian?" — it gets a 17-language tourism description generated from factual information referenced to official Japanese public sources.

2. Or grab the raw dataset for your own pipeline:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="open-travel/japan-travel-mcp-data", repo_type="dataset")

See also: 🤗 Dataset card

Quick start (Claude Desktop)

{
  "mcpServers": {
    "japan-travel": {
      "command": "npx",
      "args": ["-y", "japan-travel-mcp"]
    }
  }
}

On first run, the server downloads ~685 MB of travel data from huggingface.co/datasets/open-travel/japan-travel-mcp-data to ~/.japan-travel-mcp/data/ (override the cache location with the JAPAN_TRAVEL_MCP_CACHE env var). Subsequent runs use the local cache.

Remote / HTTP transport (hosted MCP)

Besides stdio, the server ships a Streamable-HTTP transport (src/index_http.ts) for always-on hosts — Hugging Face Spaces, Cloudflare, or any server where web / SaaS MCP clients connect over HTTP instead of spawning a local process.

npm run build && node dist/src/index_http.js
# POST /mcp      — Streamable-HTTP MCP endpoint
# GET  /healthz  — liveness probe
# GET  /         — landing page

It listens on PORT (default 7860, the HF Spaces convention) and honours the same JAPAN_TRAVEL_MCP_CACHE / HF_TOKEN env vars as the stdio entrypoint. The transport is stateless — a fresh MCP server is created per request — so concurrent clients don't share state.

Get the raw data directly

If you'd rather work with the JSON / JSONL files yourself (e.g. fine-tune a model, embed everything into a vector store, run analytics), download from the dataset repo:

from huggingface_hub import snapshot_download
local_dir = snapshot_download(
    repo_id="open-travel/japan-travel-mcp-data",
    repo_type="dataset",
)

# or via plain git
git clone https://huggingface.co/datasets/open-travel/japan-travel-mcp-data

All 47 prefectures and all 1,938 entities (1,741 municipalities + 197 designated-city wards) are covered in parallel — no prioritization by population, fame, or tourism volume. Tokyo and Kyoto are here too — but the point of this dataset is everywhere else.

Why this exists

Japan has incredible destinations, rich local culture, and tourism information published across thousands of municipal websites — almost none of it accessible to AI agents.

I've spent years in Japan's travel industry. I know how much the world is missing. This project exists because I believe Japan deserves to be better represented in the AI era.

Not a business. A contribution.

— KJ Sunada, founder of KabuK Style, building products for Japan's travel industry since 2019.

Why not just use Wikipedia / Google Places / JNTO?

| | japan-travel-mcp | Wikipedia | Google Places | JNTO open data | |:-------------------------------------------------|:--------------------:|:----------------:|:--------------:|:--------------:| | Coverage of all 47 prefectures × 1,938 entities | ✅ | partial | ✅ | partial | | 17-language tourism descriptions (200-300 chars) | ✅ 13K spots | sparse | name only | EN + a few | | Official designation registries (MAFF GI etc.) | ✅ 690 records | partial | ❌ | partial | | MCP server out of the box | ✅ | ❌ | ❌ | ❌ | | Open dataset (CC BY 4.0) | ✅ | ✅ (CC BY-SA) | ❌ | varies | | Free / no API key / no quota | ✅ | ✅ | ❌ (paid) | ❌ (key) | | Refreshed daily on a 30-day rolling cycle | ✅ | community | ✅ (proprietary)| varies |

The point isn't that Wikipedia or Google Places are wrong — they cover Tokyo and Kyoto fine. The point is that AI agents need a single, structured, multilingual, license-clear source for everywhere else in Japan, and that didn't exist until now.

What's inside

Tools (MCP)

| Tool | Description | |------|-------------| | search_area | Search across prefectures, municipalities, and 41,000+ Wikidata attractions by name or keyword | | search_semantic | Vector search over the multilingual-e5 embedding index — semantic similarity, language-agnostic | | search_hybrid | BM25 lexical + vector + RRF fusion — the preferred general-purpose retriever | | get_spots | Tourist spots by prefecture or municipality (combines municipal scrape + Wikidata) | | get_hotels | About 20,000 accommodations (Wikidata + OpenStreetMap merged) — filter by area or lat/lng/radius | | get_transport | Spot coordinates, prefecture, municipality, and the official URL where access is documented | | get_events | Festivals registered in Wikidata for a given prefecture, with optional month filter (live SPARQL, in-memory cache) | | get_multilingual | Tourist-spot names in EN / ZH / KO (lightweight name lookup) | | get_description | SIGNATURE TOOL. 200-300 character tourism descriptions in 17 languages for 13,000+ spots, generated for global tourist consumption | | get_local_specialty | Regional specialties (food + crafts) by prefecture, drawn from official designation systems: MAFF Geographical Indications (172 items) + METI-designated Traditional Crafts / Dentō Kōgeihin (231 items) | | get_local_food | Regional cuisine — MAFF GI food rows plus tourism-association pages tagged as local-cuisine / specialty-dish / regional-food, deduplicated and prefecture-scoped | | get_festivals | Festivals (matsuri, Shinto rites, annual rituals) — UNESCO ICH + Important Intangible Cultural Properties + scraped municipal and tourism-association festival pages, with Schema.org Event metadata where present | | get_traditional_arts | Intangible cultural assets from the Agency for Cultural Affairs — Important Intangible Cultural Properties + Folk (125 items) + UNESCO ICH inscriptions for Japan (58 items) | | get_japan_heritage | All 104 Japan Heritage (Nihon Isan) stories from the Agency for Cultural Affairs, with theme / era / prefecture filters | | get_dmo | 観光庁 (Japan Tourism Agency) registered + candidate Destination Management Organizations (DMOs) | | get_entity_full | Full denormalised card for a single Wikidata entity, joined across every data layer | | get_entities_bulk | Batch variant of get_entity_full — many QIDs resolved in one call | | plan_feasibility_check | Sanity-check a multi-stop itinerary against the dataset (distance / travel-time / opening hours) |

Signature tool: get_description This MCP exposes 17-language tourism descriptions — English, Japanese, Chinese, Korean, French, Spanish, German, Italian, Portuguese, Russian, Thai, Vietnamese, Indonesian, Malay, Arabic, Hindi, Tagalog — for over 13,000 Japanese tourist spots. Each description is 150-350 characters in the target language, generated by Claude Sonnet 4.6 from factual information extracted and structured by the project, Wikidata-derived structured data, and a project-wide canonical glossary (e.g. 神社 → Shrine, 寺 → Temple, modified Hepburn romanisation for place-names) for consistency across the 17 languages. These descriptions are not source-page translations or reproductions.

This dataset does not exist anywhere else. Pre-built, queryable in one call, no API key required.

Data layers

Layer 1: Municipal tourism pages           — all 1,938 entities (incl. designated-city wards)
Layer 2: Tourism-association portals       — all 47 prefectures + municipal/regional bodies
                                             (city-hall + tourism-org URLs scraped in parallel,
                                             see docs/decisions/0001)
Layer 3: Hotel & ryokan master list        — built from Wikidata + OSM (see below)
Layer 4: Wikidata attractions              — 41,404 ja-anchored entities
Layer 5: 17-language translation layer     — Wikipedia sitelinks + Sonnet 4.6 batch
Layer 6: Official designation systems      — MAFF Geographical Indications,
         METI Traditional Crafts (Dentō Kōgeihin), Japan Heritage (Nihon Isan),
         Important Intangible Cultural Properties, UNESCO ICH (Japan)

The municipal-page layer extracts factual information and any embedded Schema.org Event / Place JSON-LD blocks, so festival dates, venues, and place metadata feed get_festivals / get_spots without depending on freeform descriptions alone. Source-page prose, photos, and editorial descriptions are not redistributed in the dataset. See docs/decisions/0001-multi-source-tourism-data.md for the rationale.

Official designation sources (`data/r3/`)

This MCP only surfaces what authorities have officially designated — no editorial or AI-curated picks. Each record carries the source URL and authority so provenance is verifiable.

| Source | Records | Authority | Refresh | |:---|---:|:---|:---| | Geographical Indications (GI) | 172 | Ministry of Agriculture, Forestry and Fisheries (MAFF) | Mon (weekly) | | Traditional Crafts (Dentō Kōgeihin) | 231 | Ministry of Economy, Trade and Industry (METI) | Tue (weekly) | | Japan Heritage (Nihon Isan) | 104 | Agency for Cultural Affairs | Wed (weekly) | | Important Intangible Cultural Properties + Folk | 125 | Agency for Cultural Affairs — mirrored via Wikidata | Thu (weekly) | | UNESCO Intangible Cultural Heritage — Japan inscriptions | 58 | UNESCO — mirrored via Wikidata | Thu (weekly) |

All 690 designation records are translated to the same 17 languages by an incremental Sonnet 4.6 batch (see scrapers/translate/translate_r3.ts).

How the hotel master list is built

No single source covers all of Japan's accommodations. We merge two public open-data sources and resolve duplicates by location and name.

Wikidata + OpenStreetMap → entity matching → master.json

Sources used:

Wikidata — accommodation entities tagged in Japan (CC0). Multilingual labels.
OpenStreetMap — tourism=hotel|hostel|guest_house|motel and tourism=apartment nodes/ways inside Japan (ODbL). OSM-derived fields in each record carry the ODbL license; the project-created compilation around them is CC BY 4.0.

Matching logic: Two records are considered the same property if they fall within 100 meters of each other AND share a sufficiently similar name (accounting for kanji / kana / romaji variations). Singleton records (only one source) are kept with confidence: "singleton"; merged clusters get confidence: "confirmed".

Uncertain matches (similar names, slightly different positions, or one-sided metadata) are written to data/hotels/review/ — open for community resolution.

This pipeline is fully open source. The matching engine is in scrapers/matcher/. Each uncertain match is written as one file in data/hotels/review/<id>.json, holding the candidate records and why they were flagged:

{
  "id": "0120570838cb",
  "confidence": "likely",
  "match_reasons": ["distance 29m", "name similarity 0.83 (likely)"],
  "candidates": [ /* the OSM / Wikidata records being compared */ ]
}

To resolve one, inspect the candidates, decide whether they're the same property, and open a PR that records your call on the file — add a top-level "decision": "merge" | "split" (and an optional "rationale") so it can be folded back into the matcher. Imperfect matches are PRs waiting to happen.

Roadmap. Adding the prefectural ryokan business-license registries is on the wishlist — those are the authoritative national registry — but each prefecture publishes its list in a different format and we haven't unified the parsing yet. PRs welcome.

A note on data collection

I built this because Japan's tourism information —
created to reach the world — is nearly invisible to AI agents.
That gap seemed worth fixing.

Here's how I think about robots.txt:
I read it on every domain I crawl. I respect clear intent —
private paths, member areas, anything not meant to be public.
But when a municipality publishes tourism content
to attract visitors from around the world,
I don't think blocking AI agents serves that intent.
I think it contradicts it.

You may disagree. That's a fair conversation to have.

What I commit to:

Each domain refreshed at most once every ~30 days (rolling cycle)
Steady-state: 5-second minimum interval between requests to the same domain — slower than Googlebot, by design
Initial bootstrap may run faster (down to 2 seconds per domain) to complete the first build in hours, never less than that
Static caching only — source sites are never hit at query time
48-hour response to any removal request (open an issue)

— KJ Sunada

Data freshness

We aim to keep every record fresh within 30 days.
That's the freshness target — not a server-load mitigation.
We are not a continuous crawler. Tourism information changes slowly; 30 days is enough.

Two refresh tracks (both run by the same daily GitHub Actions cron at 03:00 JST):

| Track | Items | Cycle | Per-day work | |:---|:---|:---|:---| | Municipal tourism pages | 1,938 entities | rolling 30 days | ~70 / day | | Official designation sources | 5 sources | rolling 7 days | 1–2 sources / day |

Each domain is hit at most once per cycle. The designation sources update infrequently (annual / quarterly), so the 7-day rotation keeps every record fresh well within their real upstream cadence.

Initial dataset: bootstrapped in a single run (a few hours, 2-second per-domain interval).
Steady-state schedule: daily cron, 5-second per-domain interval.
Last updated: see data/metadata.json

Repository structure

japan-travel-mcp/                          # this repo — code + lightweight metadata only
├── README.md
├── CONTRIBUTING.md
├── DATA_POLICY.md
├── src/
│   ├── index.ts                           # MCP server (18 tools)
│   └── lib/hf_data.ts                     # HF dataset bootstrap (first-run download)
├── data/                                  # only what readers / contributors need to see
│   ├── _logs/                             # daily scrape run summaries (transparency)
│   ├── _state/
│   │   ├── scrape_state.json              # live operational state (Actions updates)
│   │   ├── translation_batch.json         # historical Anthropic batch IDs
│   │   └── r3_translation_batch.json
│   ├── hotels/review/                     # unresolved matches — PRs welcome
│   ├── knowledge/taxonomies/              # regions + eras (lightweight reference)
│   └── metadata.json                      # source list
├── scrapers/                              # producers of the dataset
│   ├── daily.ts                           # municipal-page rolling refresh
│   ├── r3_refresh.ts                      # official-designation 7-day rotation
│   ├── municipal/, hotel/, sources/, matcher/
│   ├── translate/                         # Sonnet 4.6 batch translators
│   └── hf/                                # uploads data to the HF dataset
└── .github/workflows/
    └── scrape.yml                         # daily 03:00 JST cron

Bulk runtime data lives separately on Hugging Face: huggingface.co/datasets/open-travel/japan-travel-mcp-data

data/translations/    # 17-language names + 200-300 char descriptions
data/prefectures/     # 47 prefectures of municipal-scrape + Wikidata spots
data/hotels/master    # ~20,000 unified hotels & ryokan
data/hotels/raw/      # OSM + Wikidata pre-merge sources
data/glossary/        # canonical terms used at translation time
data/_state/wikidata_attractions.json + 3 muni files
data/r3/              # MAFF GI, METI crafts, Japan Heritage, Bunka-cho intangible records, UNESCO ICH

The MCP server downloads these on first run (cached in ~/.japan-travel-mcp/data/).

Multilingual data and translation pipeline

The project ships 17 languages as a first-class feature, not an afterthought.

Coverage matrix (2026-04-27):

| Layer | What | Coverage | Source | |:---|:---|:---|:---| | Attraction names | Canonical entity name in 17 languages | 13,961 entities × 17 langs (237,337 pairs) | Wikipedia sitelinks + Sonnet 4.6 batch | | Attraction descriptions | 200-300 char tourism description in 17 languages | 13,394 entities × 17 langs (227,698 descriptions) | Sonnet 4.6 batch, glossary-grounded | | Wikipedia-anchored names (raw) | Sitelinks-only subset | 41,404 entities, sparse cross-language | Wikidata SPARQL sitelinks | | Designation-source translations | Names + descriptions for every officially-designated record | 690 records × 17 langs (100% coverage) | Sonnet 4.6 batch from official designation records and source references |

17 supported languages: English (en), Japanese (ja), Chinese (zh), Korean (ko), French (fr), Spanish (es), German (de), Italian (it), Portuguese (pt), Russian (ru), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms), Arabic (ar), Hindi (hi), Tagalog (tl).

Translation source hierarchy (highest authority first):

Wikipedia article titles (via Wikidata sitelinks) — human-curated, peer-reviewed, used directly when available.
Project canonical glossary (data/glossary/seed_canonical.json + mlit_canonical.json) — house style for suffix mappings (e.g. 神社 → Shrine, 寺 → Temple, 城 → Castle), modified Hepburn romanisation, proper-noun handling. Loaded as a cached system prompt for AI consistency.
Claude Sonnet 4.6 batch translation — fills gaps that Wikipedia doesn't cover, constrained by the canonical glossary above. 50% batch-API discount; all 13,961 names + 13,394 descriptions + 690 designation records generated for ~$111 total.

Output: published as translations/multilingual_complete.jsonl (names), translations/descriptions_complete.jsonl (attraction descriptions), and r3/translations/r3_translations.jsonl (designation-source records) on the HF dataset, all in JSONL.

English-first principle: While the source data is Japanese, this dataset is designed for global consumption. Field ordering, default tool responses, and documentation prioritize English. Other languages are first-class but always listed after English.

Contributing translations

We welcome PRs improving:

Translation quality (especially for low-confidence entries — see confidence field)
The canonical glossary (seed_canonical.json)
Coverage for entities currently lacking a Japanese Wikipedia anchor
Alternative translation pipelines (DeepL, Gemini, etc.) — document the source so consumers can choose by provenance.

Tests

The project ships a Vitest suite covering the pure helpers and the Hugging Face bootstrap. Tests run fully offline — the HF download path is exercised against a stubbed fetch, so no network access (and no HF_TOKEN) is required.

npm test              # run once
npm run test:watch    # watch mode
npm run test:coverage # v8 coverage report (text + html)

What's covered today:

src/lib/hf_data.ts — HF bootstrap (cold cache, idempotency, missing-only refetch, auth header, 401/404/500 paths, local checkout fallback)
scrapers/lib/canonical.ts — URL canonicalisation
scrapers/lib/spot_filter.ts — spot quality filter
scrapers/lib/extractor.ts — HTML → structured page data
scrapers/lib/prefecture_match.ts — prefecture / municipality name matching (sync paths)
scrapers/lib/discover.ts — isTourismLike keyword detection
scrapers/lib/state.ts — pickStaleMunicipalities rotation logic

Test files live under tests/, fixtures under tests/fixtures/. Tests are type-checked alongside the rest of the project — npm run typecheck:all covers src/, scrapers/, and tests/.

Quality report

Beyond unit tests, npm run quality:report audits the contents of the dataset: a per-prefecture coverage matrix (spots / hotels / festivals / local food / heritage) plus a per-spot quality score (description, body paragraphs, address, coordinates, Schema.org metadata, image) banded low / medium / high. Outputs land in data/_logs/quality_report_*.md and are the baseline we measure the multi-source sprint against.

Contributing

PRs are not just welcome — they're the whole point.
See CONTRIBUTING.md.

If your PR touches a data fetcher (anything under scrapers/ or DATA_SOURCES.md), one rule applies: the PR must update DATA_SOURCES.md and npm run validate:data-sources must pass (CI green) before merge. That's the single rule, enforced by CI. No exceptions.

License

Data: CC BY 4.0
Code: MIT

Attribution: Japan Travel MCP by KJ Sunada