@icjia/filecap
v1.7.33
Published
File inventory CLI for accessibility audit scoping
Downloads
5,185
Maintainers
Readme
@icjia/filecap
File inventory CLI for accessibility audit scoping.
filecap walks a directory tree, introspects each file (PDFs, DOCX, XLSX), and produces a structured NDJSON inventory suitable for accessibility remediation scoping. The primary use case is generating per-server inventories of file stores (Strapi /uploads directories, general file servers) to hand to remediation vendors so they can produce a defensible, fixed-price quote on ADA Title II / WCAG 2.1 AA remediation work.
Are you a...
- Manager running an organization that has to comply with accessibility law? → Manager TL;DR
- Developer evaluating this for technical fit? → Developer TL;DR
- Accessibility vendor or auditor receiving an inventory? → Vendor / auditor TL;DR
- Just curious about what problem this solves? → Curious-onlooker TL;DR
TL;DR for managers
You run a website. Like most websites, it hosts hundreds or thousands of uploaded documents — PDFs of meeting minutes, Word documents of policies, image attachments, spreadsheets. Federal accessibility law (ADA Title II / WCAG 2.1 AA) requires those files to be accessible to people with disabilities.
To budget the remediation work, you need to know what's actually there: file counts by type, which PDFs are scanned images (may need OCR — often substantially more expensive than tagging born-digital PDFs), which Word docs lack heading structure, which tables are missing header rows, and so on.
filecap produces this inventory automatically. It walks your website's /uploads folder, parses every file, and writes a spreadsheet (CSV) plus an interactive HTML report with one row per file and detailed accessibility-relevant metadata. You hand the spreadsheet to a remediation vendor; they give you a fixed-price quote with confidence.
The included audit-remote.sh script automates the entire workflow against any server you have SSH access to. Auditors run one command, answer a few prompts, and get a vendor-ready deliverable. Works on macOS, Linux, and Windows (via WSL2). Free; open source.
Three things you, as a manager, get out of this:
- A precise count of files that may need remediation, with composition (not just
wc -l). - A spreadsheet you can email to bid-out vendors without explanation.
- Repeatability — re-run quarterly, see what changed.
As of 1.2.0, you can also publish the latest snapshot to a URL that your whole team can bookmark — one command bundles everything and deploys to Netlify. See Publishing a fleet snapshot.
Current shape of the fleet rollup (v1.7.x): the page reads like an infographic, not a spreadsheet. The header carries the ICJIA wordmark + a prominent "Use ICJIA's PDF audit tool" button that links to https://audit.icjia.app. The hero leads with the audit count (the actionable number — e.g. "4,871 files may need accessibility audit") in big amber type, with a donut chart on the right showing the audit-share percentage and a plain-English caption like "About half may need audit." Below the hero, each ICJIA site is a card sorted alphabetically by title, with a coloured access-method chip ("Strapi CMS / SSH required" / "GitHub repo / access required" / "Server / SSH required") so a remediator can see at a glance what credentials each site needs, and a "Last audit: " caption under the download button so staff can tell whether their downloaded CSV is current. The whole card is one click target; the per-card "Download spreadsheet" button still works independently. A "Technical details" disclosure on each card expands to a labeled mini-grid (Website / IP / Hostname / Path / URL) with copy-to-clipboard buttons on every row. The "By file type" section now drills down — click "PDFs" and you get a full detail page listing every PDF across the fleet, plus a CSV download with just those rows; same for Word documents, Excel, PowerPoint, images, text, archives, web files, and other. The per-site detail page mirrors the index hero, surfaces an "How to access this site's files" panel (with the SSH-key / GitHub-access requirement plus "Contact IDS at ICJIA to request access"), has copy-to-clipboard buttons on every meta-grid row, and a sticky bar at the top with Back / Audit-a-PDF / Download-CSV / Last-audit-date. All manager-facing strings use "may need" rather than prescriptive "needs," and every CSV the bundle emits — per-site, master, and by-file-type — carries two new staff-fill columns (Delete? defaulting to "No" + free-text Notes) so the audit team can mark which files should be removed before the next scan. As of v1.7.30 the index also has a violet "Coming soon" section at the bottom listing four in-development reference-discovery features (the Referenced + Status columns, cross-site detection, SPA-page rendering, and sitemap-validated reference URLs) — managers see the roadmap on the deployed bundle without needing to dig through the repo. See the v1.7.x CHANGELOG entries for the version-by-version breakdown.
→ Skip to Quick start for managers for handoff instructions.
TL;DR for developers
Node.js CLI written in ESM, distributed via npm as @icjia/filecap. Walks a directory tree (concurrent-bounded), produces line-delimited JSON (NDJSON): a header line, one entry per file, a footer line. Each entry includes filesystem metadata + SHA-256 hash + format-specific introspection (pdfjs-dist for PDFs, jszip + fast-xml-parser for DOCX, exceljs for XLSX). 16-column CSV writer (14 file-descriptor columns + Delete? / Notes staff-fill columns added in v1.7.16, csvOnly so the HTML view stays at 14) + self-contained dark-mode HTML report with sortable/filterable client-side JS. Cross-server rollup with content-duplicate detection via SHA-256.
Includes an MCP server (filecap mcp) exposing five tools for AI agents (Claude Desktop, Claude Code, Cursor, Windsurf, Continue): filecap_scan, filecap_rollup, filecap_report, filecap_query_inventory, filecap_web_rollup.
As of 1.2.0: filecap web-rollup bundles the most recent scan of every saved site into a static-site directory, ready for Netlify deployment (drag-and-drop, CLI, or Git-connected auto-deploy). Includes auto-generated netlify.toml, optional client-side SHA-256 password gate (--password), --no-client-gate for Netlify dashboard Site Password, and a --deploy flag that calls the Netlify CLI directly after the build.
Two distribution shapes: filecap CLI invoked directly via npx, plus standalone bash scripts (audit-remote.sh, audit-fleet.sh) auditors curl from GitHub raw URLs. The bash scripts handle SSH preflight, rsync mirroring (for older Ubuntu servers that can't run Node 20+), and post-scan path rewriting so the resulting CSV reflects source-server paths regardless of where filecap actually ran.
ESM-only. Node 20+ required. 30 test files; 434 tests via vitest. Source under src/; entrypoint bin/filecap.js. License: MIT.
v1.7.x architecture summary. renderCard and generateIndexHtml (exported from src/web/index-page.js) build the fleet index — alphabetically-sorted <article class="site-card"> elements, each with the dp-hero pattern (nickname → big full name → two-up tiles → CSS-only conic-gradient donut → expanded tech-details with copy buttons → "Last audit: …" caption). writeHtml (in src/report/html.js) renders the per-site detail page using the same dp-hero block — accepting accessKind for the access-method panel and emitting both the <table class="row-marker-table"> legend and the file table with table-layout: fixed + <colgroup> + per-<th> resize handles. CSV_COLUMNS (in src/report/csv.js) is the single source of truth for column layout — entries flagged csvOnly: true (the deleteFlag and notes rows) are filtered out of the HTML view via CSV_COLUMNS.filter(c => !c.csvOnly) so the web table stays at 14 columns while CSVs ship at 16. deriveAccessKind(site) and TYPE_BUCKETS (both exported from src/commands/web-rollup.js) classify sites into strapi / github / server and into per-file-type buckets, respectively; the latter drives one CSV+HTML pair per non-empty bucket (audit-pdfs.csv + audit-pdfs.html, etc.) — each bucket page reuses writeHtml with a consolidated header so the file table reads "across the fleet, filtered to PDFs" out of the box. Click handling on the index card uses pointer-events: none on non-interactive descendants with pointer-events: auto re-enabled on the action buttons, tech-details summary, and copy-to-clipboard buttons — much cleaner than the v1.7.1 z-index attempt that only let clicks land on padding gaps. Pure CSS / vanilla JS throughout — no chart library, no preprocessor; the only inline JS is the column-resize, tab-pan, and clipboard-copy handlers embedded in the report HTML.
→ Skip to Quick start for installation and basic usage.
TL;DR for vendors and auditors
You receive an audit-file-list.csv (16 columns, one row per file) with everything needed to scope and quote a remediation engagement:
- Identification: server name, website nickname, server IP, public URL (position 4 since v1.7.2 — front and centre so you don't have to scroll), date published, source folder on server, file location, full path, filename, extension, category.
- Filesystem metadata: size in bytes, SHA-256 content hash (Excel-text-formula-wrapped so it doesn't auto-convert to scientific notation), duplicate-of reference.
- Staff-fill columns (CSV-only, since v1.7.16):
Delete?(defaults to "No" — staff marks "Yes" to flag a file for removal before the next audit) andNotes(free-text). These columns don't appear in the HTML view; the web is informational, the CSV is the actionable artefact.
Format-specific introspection columns (PDF page count, image-only/OCR, DOCX heading coverage, XLSX sheet count, etc.) were dropped from the CSV in v1.4.0/1.4.1 — remediators open the file in Adobe Acrobat / Word / Excel and read those properties directly. The full introspection is still carried in the underlying NDJSON inventory if your tooling wants it (the MCP server exposes a filecap_query_inventory tool for that).
The "Server IP" and "Full file path on server" columns identify exactly where each file lives — you ssh into the server and download the file directly. Optionally accompanied by an audit-file-list.html rendering of the same data with sortable/searchable browser-based interface (without the staff-fill columns, by design).
As of 1.2.0, the auditor can also publish the fleet snapshot to a Netlify URL for a shared web-based view — useful for review meetings where you navigate by clicking rather than filtering a spreadsheet.
Zero account creation; the inventory is a vendor-neutral structured file you can ingest into your own tooling.
→ Skip to Report workflow for the full output spec.
TL;DR for the curious
filecap was originally built at ICJIA (the Illinois Criminal Justice Information Authority) to inventory the document files on our agency's public-facing websites — PDFs of meeting agendas, annual reports, statutes, etc. Federal accessibility law requires those files to be reachable for screen-readers, keyboard navigation, and assistive technology, but figuring out exactly which files need which kind of work, across multiple servers, was a manual job that took weeks.
The tool is general-purpose. Any organization that hosts public-facing document repositories — government agencies, schools, libraries, nonprofits, businesses — can use it to scope their accessibility work. The output is a spreadsheet a remediation vendor can quote against, line by line.
The complexity in filecap exists because "is this PDF accessible?" is a much harder question than "does this file exist?" Answering it requires actually opening every file and inspecting its internal structure — see the next section for why this matters.
→ See the project page on GitHub: https://github.com/ICJIA/filecap-cli
"All I want is a file count for the remediators, all right? That's it. Just do it." — why filecap is more than wc -l
Imagine asking a remediation vendor for a quote. They say "I need to see the files first." You forward them a list of filenames and sizes. They reply: "Great — but how many are scanned PDFs vs born-digital? How many Word docs lack heading structure? How many tables are missing header rows? Without that detail, my quote will be the worst-case price for every single file."
That's why filecap exists. A simple find . -type f gives you filenames and sizes — but a vendor can't price accurately against that. They'll either give you a worst-case quote (you overpay), or insist on inspecting every file themselves (the audit takes weeks instead of hours).
filecap is built around one question: what does a remediation vendor need to know, per file, to give a defensible fixed-price quote? Every "complexity" in this tool answers a specific vendor question:
| Vendor question | What filecap captures |
|---|---|
| Is this PDF a scan (needs OCR — often substantially more expensive)? | isImageOnly, hasTextLayer, textLayerCoverage |
| Is this PDF already partly accessible? | hasTags, documentLanguage |
| Does this PDF need special handling? | encrypted, hasFormFields, hasSignatures |
| Is this Word doc structured for screen readers? | hasHeadings |
| Are tables marked up for accessibility? | tableCount, tablesHaveHeaders |
| Do images have alt text? | imageCount, altTextCoverage |
| Are hyperlinks descriptive? | vagueLinkCount (counts "click here", "read more", etc.) |
| Are spreadsheets navigable for screen readers? | xlsxSheetCount |
| Do the same files appear on multiple servers? | sha256 content hash + duplicateOf cross-server linking |
| Are filenames human-readable? | filename heuristic flags |
The cost of NOT having this information is often substantially greater than the cost of running filecap. Scanned PDFs typically cost vendors substantially more to remediate than born-digital ones, because OCR + tagging is an order of magnitude more work than tagging alone. If your inventory has 100 PDFs and 30 of them are scanned, knowing that distinction affects the vendor quote materially.
filecap takes a few seconds per file to extract this metadata — and produces a spreadsheet a vendor can price line by line. That's the whole game.
So: yes, "just count the files" is a one-liner. But the count alone won't help you budget for compliance. The detail is the point.
Security audit
filecap is open source and tries to be transparent about its security posture.
The full audit findings and mitigations are in docs/security/audit-2026-05-10.md
(initial 1.3.0 baseline audit) and docs/security/audit-2026-05-11.md (re-audit
of every release through 1.6.5, covering the bearer-token store, git-clone
audit script, master/duplicates CSV exposure surface, deploy-time review, and
inline-JS additions). The summary below is for managers and auditors.
What we protect
- Auditor credentials. SSH keys and any
FILECAP_AUDIT_TOKENenv var never appear in any output, log, or transcript. - Bearer tokens (1.3.3+). JWT bearer tokens for sites whose public URL requires auth (
intranetin the ICJIA fleet) live in~/.filecap/secrets.json(mode0600) or aFILECAP_BEARER_TOKEN_<SERVER_NAME>env var. The token is fed tocurlvia stdin (--header @-), never argv, so it does not appear inps aux.secrets.jsonis never bundled, never exported via the saved-sites menu, never sent to a remediator. - Shell injection. Every variable interpolated into SSH remote-command strings is quoted via
printf '%q'to prevent command injection from malicious site configs. - rsync symlink escape. The
--no-linksflag prevents a compromised remote server from using symlinks to copy files outside the intended uploads directory. - The audit script verifies its own SHA-256 against the GitHub
mainbranch on every run (--no-version-checkto skip). - The published npm package uses
npm pack+ explicit-tarball publish with 2FA-required publishes. - Network transit is HTTPS for the audit-remote.sh download (raw.githubusercontent.com), npm package install, and Netlify deployment.
- Bundle privacy uses Netlify's server-side Site Password (Pro plan) — gates every file in the bundle including the master CSV (verified HTTP 401 on both the index and
audit-file-list-master.csvfor the production deployment). - Output directory
~/filecap-audits/<server-name>/is created with mode 700 (user-only readable). - Configuration files at
~/.filecap/config.json(autoDeploy + deploySite) and~/.filecap/secrets.json(bearer tokens) — schema-validated on load via Zod (strict mode rejects unknown fields, catches typos). Both files mode-0600. - MCP scan path restriction. Set
FILECAP_MCP_ALLOWED_PATHS(colon-separated absolute paths) to restrict which directories an AI agent can scan.
What we don't protect (residual risk)
- The optional client-side password gate (
--passwordflag) is for "ward off the curious" only. The SHA-256 hash is unsalted and can be cracked offline with no rate limiting. Anyone with view-source can read all content. Do not use this gate for content you would not share publicly if the password were guessed. Use Netlify Site Password for actual enforcement. - A compromised remote server could serve malicious PDFs that exploit pdfjs-dist parsing bugs (we depend on the upstream parser being patched). More rigorous isolation (sandbox/container) is future work.
- Stolen
~/.filecap/sites.jsonreveals server hostnames and remote paths but no credentials (SSH keys are never stored here). File mode is 600. - The Netlify bundle URL is not secret.
robots.txtblocks search-engine indexing, but the URL could leak via browser history or link sharing. Netlify Site Password provides the recommended protection. - Initial
curl audit-remote.shdownload. The self-version-check detects post-download tampering, but not initial-fetch tampering. For maximum verifiability, download from a specific commit SHA URL rather thanmain.
Live deployment posture (1.5.6)
The ICJIA fleet snapshot at https://icjia-fleet-audit.netlify.app was reviewed for deployment-specific risks after the initial deploy:
| Check | Status |
|---|---|
| TLS | ✓ HTTP/2 over TLS 1.3 (Netlify managed certificate, auto-renewed) |
| HSTS | ✓ strict-transport-security set by Netlify edge |
| robots.txt | ✓ User-agent: * + Disallow: / — blocks every path for every compliant crawler |
| X-Robots-Tag | ✓ noindex, nofollow on all HTML pages |
| CSV serving | ✓ Content-Disposition: attachment + Cache-Control: max-age=3600; Netlify Site Password gates these too (returns 401 to unauthenticated requests) |
| Site Password gate | ✓ Netlify Pro Site Password set via dashboard — server-side enforcement covers every file (verified HTTP 401 on /, /audit-file-list-master.csv, and a per-site report) |
| Security headers (all paths) | ✓ X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: no-referrer |
| Deploy previews | Netlify deploy preview URLs inherit the site password setting by default; no separate exposure |
| Search engine indexing | Belt-and-suspenders: robots.txt blocks crawlers; X-Robots-Tag blocks indexing of any page that gets crawled anyway; the URL pattern is non-discoverable (no inbound links from public sites); password gate blocks content delivery regardless |
| Bundle URL secrecy | URL is not secret; could leak via browser history or link sharing. Site Password is the real protection. |
The deployment review did not find new findings beyond the 1.3.0 audit's residual-risk list. The Netlify Pro Site Password upgrade (compared to the 1.3.0 client-side gate) closes FC-2026-005 (unsalted-SHA-256 cracking risk) and FC-2026-014 (publicly-guessable bundle URL) — both were "documented" findings now mitigated by the server-side gate.
Audit findings summary (1.3.0 baseline)
Findings below come from the 1.3.0 red/blue team audit. Versions 1.3.1 through 1.5.6 added features (bearer-token storage, master CSV, duplicates section, infographic hero, etc.) but did not change the core security posture of the original components. A full re-audit is not scheduled; see "Changes since 1.3.0" below for what's new and how each was reviewed.
| ID | Severity | Finding | Status |
|---|---|---|---|
| FC-2026-001 | Critical | Shell injection via REMOTE_PATH in SSH scan commands | Fixed in 1.3.0 |
| FC-2026-002 | Critical | Shell injection via path in SSH test/find/du commands | Fixed in 1.3.0 |
| FC-2026-003 | Moderate | rsync follows remote symlinks (symlink escape) | Fixed in 1.3.0 |
| FC-2026-004 | Moderate | MCP server has no scan-path allowlist | Fixed in 1.3.0 |
| FC-2026-005 | Moderate | Unsalted SHA-256 password gate (cracking risk underdocumented) | Fixed in 1.3.0 (docs) |
| FC-2026-006 | Moderate | sitesFile path not validated (info leakage via error messages) | Fixed in 1.3.0 |
| FC-2026-007 | Moderate | sites.json not schema-validated on load | Fixed in 1.3.0 |
| FC-2026-008 | Moderate | HTML XSS coverage verification and regression tests | Fixed in 1.3.0 |
| FC-2026-009 | Low | Initial curl download not verifiable at fetch time | Documented |
| FC-2026-010 | Low | npx --yes accepts any latest version (supply-chain) | Documented |
| FC-2026-011 | Low | Audit output directory permissions not enforced | Fixed in 1.3.0 |
| FC-2026-012 | Low | pdfjs-dist parsing-attack surface | Accepted (mitigated by isEvalSupported:false) |
| FC-2026-013 | Low | jszip/exceljs zip-slip surface | Verified safe (in-memory only); documented |
| FC-2026-014 | Low | Netlify bundle URL publicly guessable | Documented |
| FC-2026-015 | Low | CSP header missing from netlify.toml | Deferred (inline scripts require unsafe-inline) |
| FC-2026-016 | Note | Client-side gate is not real security (by design) | Documented |
| FC-2026-017 | Note | Inventory NDJSON contains server metadata | Accepted (required for vendor work-order) |
| FC-2026-018 | Moderate | audit-static.sh exposed FILECAP_GITHUB_TOKEN via argv to git clone / git remote set-url | Fixed in 1.6.6 (token now passed via GIT_CONFIG_* env vars, not URL) |
| FC-2026-019 | Note | Master CSV + duplicates CSV in bundle increase data-exposure surface | Accepted (mitigated by Netlify Pro Site Password) |
| FC-2026-020 | Note | ~/.filecap/secrets.json readable by same-UID processes | Accepted (standard user-account trust model; env-var override available for 1Password CLI users) |
| FC-2026-021 | Note | audit-static.sh clone dir trusts repo contents | Accepted (same as Strapi mirror; auditor only clones repos they trust) |
| FC-2026-022 | Note | New inline JS in HTML reports (1.4.0+) reviewed for XSS | No new findings — all handlers use class-list / dataset reads, no innerHTML/eval |
Changes since 1.3.0 (security-relevant)
| Version | Change | Security implication | Mitigation |
|---|---|---|---|
| 1.3.1 | audit-fleet.sh auto-reads ~/.filecap/sites.json | No new surface — same data the saved-sites menu already exposed | sites.json mode-0600, schema-validated; bundle workflow safe for sharing |
| 1.3.2 | ~/.filecap/config.json for webRollup.autoDeploy | New file at ~/.filecap/ | Schema-validated, mode-0600; contains a Netlify site name (not a secret) |
| 1.3.3 | Bearer-token support via ~/.filecap/secrets.json | New credential at rest | Mode-0600, never bundled/exported; env-var override for users who prefer 1Password CLI / direnv; token fed to curl via stdin (--header @-), never argv |
| 1.4.0 / 1.4.1 | Trimmed CSV/HTML to 14 columns; click-and-drag pan JS | No new surface — drag-pan is pointer-events only, no remote requests | XSS test suite (FC-2026-008) regression-covers the new render path |
| 1.5.0 | Cross-server duplicates section; audit-file-list-master.csv in bundle | Adds data-exposure surface (master CSV is a single ~7 MB file with every path on every server) | Mitigated by Netlify Site Password gate at deployment time (verified HTTP 401 on the master CSV) |
| 1.5.1 | audit-file-duplicates.csv (per-occurrence) in bundle | Same data-exposure surface, smaller file | Same mitigation |
| 1.5.2–1.5.6 | Visual / UX changes (table styling, infographic hero, total in heading) | No new security surface | n/a |
How to report a security issue
Email the audit administrator or open a private GitHub Security Advisory at
https://github.com/ICJIA/filecap-cli/security/advisories/new.
Do not open a public GitHub issue for security bugs.
Acknowledged within 5 business days.
How to verify the audit yourself
cat docs/security/audit-2026-05-10.md # initial 1.3.0 audit
cat docs/security/audit-2026-05-11.md # re-audit covering 1.3.1 - 1.6.5
npm audit
npx vitest run test/report-html.test.js
npx vitest run test/mcp-tools.test.js
npx vitest run test/web-rollup.test.jsTable of contents
- Are you a...
- "All I want is a file count for the remediators..."
- Status
- Quick start
- Quick start for managers
- CLI reference
- Multi-server workflow
- NDJSON output format
- What gets introspected
- Filename flags
- Rollup workflow
- Report workflow
- MCP server
- For auditors: self-contained audit scripts
- Publishing a fleet snapshot
- What filecap does not do
- Troubleshooting
- License
- Related tools
Status
v1.7.x shipped. The full inventory pipeline scan → rollup → report → web-rollup → deploy is end-to-end functional, with the v1.7.x manager-friendly visual redesign live: infographic-style site cards on the fleet index (alphabetically sorted by title), matching hero pattern on per-site detail pages, big amber audit-count headlines with a CSS-only donut chart and plain-English captions ("Two-thirds may need audit"), whole-card click → detail page, two-axis touch-friendly table scrolling, click-and-drag resizable detail-page columns, copy-to-clipboard buttons throughout (per-card tech-details + per-site meta-grid), per-file-type detail pages and CSV downloads (audit-pdfs.html/audit-pdfs.csv, audit-docx.*, etc.), staff-fill Delete? + Notes columns on every CSV (v1.7.16), a prominent "Use ICJIA's PDF audit tool" button linking to audit.icjia.app in the navbar of every page, ICJIA wordmark in the index navbar, "Last audit: …" caption under every CSV download, and a redesigned "Files that appear on more than one server" duplicates section that explains in plain English that duplicates are normal and not a webmaster error. Bundle still includes cross-server duplicates detection (1.5.0), a master CSV combining every file from every server (1.5.0), and a per-occurrence duplicates CSV for pivot in Excel (1.5.1). All artefacts deployable to Netlify with one command via the webRollup.autoDeploy config flag (1.3.2). Bearer-token support for sites whose public URL requires JWT auth (1.3.3, mode-0600 ~/.filecap/secrets.json).
| Phase | Version | Status | Deliverable |
|---|---|---|---|
| 1 | v0.1.0 | shipped | Core scan — recursive walk, hashing, NDJSON output |
| 2 | v0.2.0 | shipped | PDF introspection (image-only, tags, signatures, language) |
| 3 | v0.3.0 | shipped | Office introspection (DOCX, XLSX, legacy flag) |
| 4 | v0.4.0 | shipped | Filename flagging |
| 5 | v0.5.0 | shipped | Multi-server rollup |
| 6 | v0.6.0 | shipped | CSV reporter and summary artifacts |
| 7 | v1.0.0 | shipped | MCP server entry point |
| 8 | v1.0.1 | shipped | MCP client docs (Claude Desktop, Claude Code, Cursor, Windsurf, Continue) |
| 9 | v1.0.2 | shipped | Audit automation scripts, HTML report, enhanced metadata, auditor-readable output |
| 10 | v1.0.3 | shipped | Self-version-check, timestamped runs, --site-name flag, README overhaul |
| 11 | v1.1.0 | shipped | Column-set slim, audit.icjia.app integration removed |
| 12 | v1.2.0 | shipped | filecap web-rollup — static-site bundle with Netlify amenities; filecap_web_rollup MCP tool |
| 13 | v1.3.0 | shipped | Red/blue team security audit (17 findings, all Critical and Moderate fixed) |
| 14 | v1.3.x | shipped | Auto-detected sites.json for fleet script; opt-in ~/.filecap/config.json webRollup.autoDeploy; bearer-token support (~/.filecap/secrets.json) |
| 15 | v1.4.x | shipped | CSV/HTML deliverable trimmed to 14 columns; click-and-drag horizontal pan on every table |
| 16 | v1.5.x | shipped | Cross-server duplicates with action explainer; master CSV + duplicates CSV in bundle; infographic hero; table-styling consistency; "Back to fleet index" navigation on per-site detail pages; footer links to GitHub + CHANGELOG |
| 17 | v1.6.0 | shipped | type: "git" site mode — audit self-contained static-site (Nuxt) repos by shallow-cloning + scanning the repo's /public/ folder. Mixed strapi + git fleets in one bundle. |
| 18 | v1.7.x | shipped | Manager-friendly visual redesign: optional siteFullName field in sites.json plumbed end-to-end; 2-col infographic card grid with big two-up tiles, CSS-only donut, plain-English captions, file-type chips, clickable cards with hover elevation; matching dp-hero pattern on per-site detail pages; "Public URL" promoted to column 4; two-axis touch-pannable tables; resizable detail-page columns (drag right edge of any <th>); big visual duplicates section with always-visible plain-English explainer |
| 19 | v1.7.6 | shipped | Access-method chip on every index card + matching "How to access this site's files" panel on every per-site detail page; auto-classifies each site into strapi / github / server from existing sites.json fields via new exported deriveAccessKind(site) helper; color-coded (cyan/violet/amber) with WCAG AA contrast; both surfaces close on "Contact IDS at ICJIA to request access." with OpenSSH-key or GitHub-org-access copy as appropriate |
| 20 | v1.7.7 | shipped | Whole-card click fix on index page — switched from broken z-index stretched-link to a pointer-events: none cascade with re-enables on action buttons + tech-details summary; copy-to-clipboard buttons on five rows of the detail-page meta-grid (IP, hostname, scanned path, scanned at, public URL) with green "Copied" affordance, navigator.clipboard.writeText + execCommand fallback |
| 21 | v1.7.8 | shipped | Index-card "Technical details" disclosure now shows a five-row mini-grid (Website, IP, Hostname, Path, URL) with a copy-to-clipboard button on every row; URL row keeps a clickable <a target="_blank"> alongside the copy button. Plus a sweep through all manager-facing strings to soften prescriptive "needs/need …" to "may need …" (the bucket phrases on cards + detail dp-hero, the audit-share tile/donut labels, the by-file-type column headings, the duplicates explainer, the row-color legend, the audit-summary.txt text deliverable, the README.txt template) — filecap describes what the data suggests, the audit team decides what to do. |
| 22 | v1.7.9 | shipped | Donut percentage centring fix — text-align: center on .site-card .donut .pct so the percentage + caption sit visually centred inside the donut hole regardless of caption length. (Pre-v1.7.9 it was accidentally OK because "need audit" and "67%" were near-equal widths; v1.7.8's longer "may need audit" exposed that the .pct box was a left-aligned column.) |
| 23 | v1.7.10 | shipped | Donut grown from 130 × 130 px to 180 × 180 px (and ::after inset 14 → 22) so "MAY NEED AUDIT" comfortably fits inside the inner hole with ~10 px of clearance on each side from the colored ring; percentage glyph upsized from 1.5 em → 1.7 em for proportional balance. |
| 24 | v1.7.11 | shipped | Per-site detail page's row-marker legend redesigned as a proper 3-column <table> (Marker / What it means / What to do about it) with <thead> labels, row dividers, and a @media (max-width: 700px) stacked fallback — replaces the pre-v1.7.11 flex-paragraph layout that wrapped mid-clause across five lines. |
| 25 | v1.7.12 | shipped | Image-only PDF row tint now actually visible — hidden CSS specificity bug present since v1.0.2 where tbody tr:nth-child(even/odd) (0,1,2) outranked tr.image-only (0,1,1), so only the first cell rendered the tint. Fixed via tbody tr.image-only td (0,1,3) and a colour bump from luminance-twin #111000 to clearly amber #3a2c08 (≥ 8:1 contrast on #e5e5e5 text). |
| 26 | v1.7.13 | shipped | Index hero redesigned around the audit count rather than the total — pre-v1.7.13 the hero led with the 14k+ total-files number which managers misread as "the audit scope," and the new hero leads with the actionable count in 105 px amber with the total in a secondary context line. Two-column infographic: big number on the left, 200 px donut + plain-English phrase on the right. Stacks under 720 px. |
| 27 | v1.7.14 | shipped | Per-file-type detail pages with CSV downloads — every non-empty bucket in the index's "By file type" table emits both audit-<slug>.csv (filtered master, every row tagged with its source server) and audit-<slug>.html (per-site-style detail page reusing the dp-hero pattern). New exported TYPE_BUCKETS constant is the single source of truth (used by both the writer in web-rollup.js and the index renderer); each bucket has keys (so legacy-office/office-legacy synonyms merge), side, label, slug. |
| 28 | v1.7.15 | shipped | Three small index-page changes: cards alphabetized by siteFullName via localeCompare(..., { sensitivity: "base" }); ARI Summit cards renamed across all four years to "ARI All Sites Summit YYYY" (was "Adult Redeploy …"); ICJIA wordmark added to the index navbar (~13 kB inline SVG with currentColor fills so dark navbar gets white and print mode gets black without forking markup). |
| 29 | v1.7.16 | shipped | Three CSV/workflow changes: every CSV (per-site, master, by-type) gains two staff-fill columns — Delete? (default "No") and Notes (default "") — flagged csvOnly so the HTML view stays at 14 columns; visible PDF audit-tool button (links to audit.icjia.app, opens new tab) in the index navbar and per-site sticky bar; "Last audit: " caption beneath every CSV download button. |
| 30 | v1.7.17 | shipped | Cross-server duplicates section made info-only — pulled the audit-file-duplicates.csv download button (file still emitted server-side, available by direct URL); replaced with a "For information only" callout listing three concrete reasons duplicate removal is trickier than removing a unique file (N-times search surface, "wrong copy" risk, asymmetric references). |
| 31 | v1.7.18 | shipped | Plain-English explainer beneath the "N-times the search surface" reason — unpacks N and Big O notation for non-technical readers, contrasts O(N) with O(1), and closes with a concrete "5–15 minutes per copy" budget for the manager. |
| 32 | v1.7.19 | shipped | Duplicates table filter (Remediable only / Reference only / All) defaulting to Remediable; "Some duplicates are intentional and required" callout covering meeting agendas posted on both a specialty site and the ICJIA main site for Open Meetings Act compliance. |
| 33 | v1.7.20 | shipped | Git-type entries link to GitHub /blob/ URL instead of publicUrlBase + path — the static-site Netlify deploys have an SPA _redirects catch-all that returns the homepage HTML for unmatched paths, so deployed-URL links looked correct but went nowhere; GitHub source is the reliable destination. Duplicates hero numbers + counting-note now dynamically follow the active filter so the headline always corresponds to the current view. |
| 34 | v1.7.21 | shipped | "For AI models" section added between the master CSV and the duplicates section. Two new read-only companion files: audit-fleet.ndjson (consolidated NDJSON with full introspection — PDF page count, image-only flag, DOCX heading coverage, alt-text coverage, XLSX sheet count, all the fields the CSV strips for vendor readability) and audit-fleet-context.md (narrative with summary stats, schema doc, sample LLM prompts). Framed deliberately as optional + forward-looking — state-agency AI policy is still evolving, the section explains why the files are there and that the CSVs remain the actionable artefact. |
| 35 | v1.7.22 | shipped | Big "Cross-Server Duplicates" section banner with 72 × 5 px amber-gradient accent bar above the existing duplicates hero — eye now sees a clear "new section starts here" break after the For-AI-models block. |
| 36 | v1.7.23 | shipped | Mirror "Section · Fleet snapshot" banner at the top of the page (blue gradient — distinguishes from duplicates' amber) so the page reads as two symmetric major sections. Prominent green "Zero PII" reassurance banner with two side-by-side IN / NOT-IN lists (filenames + file metadata + format-specific structure on the IN side; SSNs / DOB / driver's licenses / names / addresses / phone / email / case-file content / personnel records / credentials on the NOT-IN side) plus an Intranet-specific footnote. |
| 37 | v1.7.24 | shipped | Duplicates explainer compressed from five colored callouts (historical context + "not an error" + "intentional" + exact/variant cards + false-positives caveat) into one cohesive block — three tight paragraphs + the exact/variant kind-cards + a collapsed <details> for the false-positives caveat. Same content, ~50 % less visual noise. Navbar audit-tool button label: "Use ICJIA's PDF audit tool" → "Try ICJIA's PDF audit tool" (softer suggestion, less prescriptive). |
| 38 | v1.7.25 | shipped | PII banner relocated from top of page to immediately above the "Websites in this audit" site grid (audit numbers + donut hero get the above-the-fold position); headline spelled out as "Zero Personally Identifying Information (PII) in this audit" so non-technical readers don't misread the acronym as "PILL"; banner vertically tightened ~30 % (smaller padding, icon, font sizes — no content cuts). |
| 39 | v1.7.26 – v1.7.29 | shipped | Sticky-bar polish on per-site detail pages (button alignment + smaller fonts), ICJIA Accessibility FAQs navbar button paired with the audit-tool button, dynamic site-count in the top-section lede, empty-default Delete? column (CSV can't carry validation dropdowns), green gradient on the detail-page CSV download button (distinct from the blue "navigate elsewhere" register), and label tightening (ICJIA PDF Audit Tool, no possessive). See CHANGELOG entries v1.7.26 – v1.7.29 for the per-version breakdown. |
| 40 | v1.7.30 | shipped | New violet-accented "Coming soon" section at the bottom of the fleet index — eyebrow + clamped headline + lede + accent bar, mirroring the existing fleet / duplicates banner anatomy. Lists four reference-discovery items currently in development on a side branch: the Referenced + Status columns ("where is this file linked from?" / Active or Orphan verdict), cross-site reference detection, SPA-page rendering (for sites where the curl crawler sees only an empty shell), and sitemap-validated reference URLs so clicked references never 404. Visual register: violet (#d2a8ff → #8957e5) — a third color identity beyond the existing blue (current state) and amber (warning) so the eye instantly registers the section as "upcoming." Links out to the CHANGELOG so managers can track progress between releases. |
| — | 1.8.0 alpha (in branch) | in progress | Reference discovery: per-site Referenced + Status columns (Active / Orphan candidate / Discovery N/A), sitemap-driven HTML crawl with polite throttling, automatic Strapi GraphQL fallback for SPA sites (auto-detected; supports v3 + v4 schemas with typed-media-relation discovery), cross-site reference resolver, coverage banner explaining what was actually scanned. Work-in-progress; see the Coming-soon section on the deployed fleet index. |
| — | vNext | deferred | Headless rendering for SPA sites where Strapi GraphQL fallback isn't sufficient (custom client-rendered tables like ARI's resources page); strapi-aware mode (separate package); content-type sanity check on URL preflight; filecap process-deletions <csv> to read staff-edited CSV Delete?-Yes rows and remove the matching files on each source server. |
Production deployment
The ICJIA fleet snapshot is deployed at:
https://icjia-fleet-audit.netlify.app
The site is password-protected (Netlify Pro Site Password — server-side enforcement, gates every file including the CSVs). The current password is held by ICJIA's IDS (Innovation and Digital Services) team — request access by emailing IDS at ICJIA. The password is rotated periodically; if a previously-shared password stops working, ask IDS for the current one.
Deploy mechanics: filecap web-rollup automatically pushes to this Netlify site whenever webRollup.autoDeploy: true is set in ~/.filecap/config.json (with deploySite: "icjia-fleet-audit"). To force a fresh deploy after a new audit, run ./examples/audit-fleet.sh && filecap web-rollup. No --deploy flag needed.
"Wait — if it's password-protected, why can I still 'view source' on the gate page?"
This is a common observation, and the short answer is: what you're viewing the source of is Netlify's challenge page, not the underlying fleet rollup. Until you authenticate, the actual inventory content (site names, file paths, public URLs, totals, the per-site CSVs, the master CSV — everything you'd consider sensitive) is never sent to your browser at all.
You can verify this for yourself in three seconds:
curl -i https://icjia-fleet-audit.netlify.app/Returns HTTP/2 401 and roughly 3.5 KB of body. That body is Netlify's password-challenge HTML — a <form>, some Netlify-managed CSS, and a brand stripe. Grep it for anything from our fleet:
curl -sS https://icjia-fleet-audit.netlify.app/ | grep -iE "dvfr|icjia|illinois|\.pdf|\.csv"
# (no matches — the challenge page contains zero references to our data)And try to fetch a specific inventory file directly without authenticating:
curl -i https://icjia-fleet-audit.netlify.app/audit-file-list-master.csv
# HTTP/2 401 — even when you ask for a specific path, you get the challenge pageThe gate is enforced at Netlify's edge (server-side), not by JavaScript in your browser. There is no "underlying source" to peek at on the gate page because no underlying content has been served. Once you enter the password, Netlify sets an auth cookie and proxies the real content; before that, every URL returns the same 3.5 KB challenge page.
For the genuinely paranoid: we have a documented fallback design (Option B in the project's internal-security notes) using a Netlify Edge Function that serves our own gate page from this repo's source, so the gate HTML is auditable in-tree rather than coming from Netlify's template. We haven't implemented it because the current setup demonstrably leaks nothing; the edge-function path is reserved for the day someone formally requests it. If you have that need, file a GitHub issue and we'll prioritise it.
Quick start
npx --yes @icjia/filecap scan /var/strapi/uploads
# writes filecap-<hostname>.ndjson in cwdThe output is line-delimited JSON: one header line, one line per file, one footer line.
Quick start for managers
If you're handing this off to an auditor or accessibility coordinator, copy the block below verbatim. They have everything they need.
For the auditor (single server):
Use macOS, Linux, or Windows with WSL2/Ubuntu (see Windows: the situation below). On Windows, run everything inside WSL2 — never PowerShell, Command Prompt, Git Bash, or PuTTY.
Install Node.js 20+ (https://nodejs.org).
Generate an OpenSSH key with
ssh-keygen -t ed25519(skip if you already have one) and have your server admin authorize it on the target server. See Setting up SSH access for the full flow.Run these three commands:
curl -O https://raw.githubusercontent.com/ICJIA/filecap-cli/main/examples/audit-remote.sh chmod +x audit-remote.sh ./audit-remote.shAnswer the prompts (SSH user, server IP, path to uploads, optional website nickname).
The deliverable is at
~/filecap-audits/<server-name>/latest/report/. Openaudit-file-list.csv(Excel/Numbers/Sheets) oraudit-file-list.html(any browser).Email the entire
report/folder to your remediation vendor.
For the auditor (multiple servers / fleet, with a sites.json bundle):
If you've been handed a
sites.jsonfile along with these instructions, it lists every site in the audit — you don't have to type any server details.
Same prerequisites as above (macOS/Linux/WSL2-Ubuntu, Node 20+, OpenSSH key authorized on every target server).
Drop the
sites.jsonyou received into~/.filecap/:mkdir -p ~/.filecap mv /path/to/sites.json ~/.filecap/Download both scripts and run:
curl -O https://raw.githubusercontent.com/ICJIA/filecap-cli/main/examples/audit-fleet.sh curl -O https://raw.githubusercontent.com/ICJIA/filecap-cli/main/examples/audit-remote.sh chmod +x audit-fleet.sh audit-remote.sh ./audit-fleet.shThe fleet deliverable is at
~/filecap-audits/_fleet/latest/. Email the whole folder (or theconsolidated-report/subfolder) to your remediation vendor.
CLI reference
filecap scan <directory>
| Flag | Default | Description |
|---|---|---|
| -o, --output <path> | filecap-<hostname>.ndjson | Output path (use - for stdout) |
| -s, --server-name <name> | os.hostname() | Override server identifier in metadata |
| --server-ip <ip> | auto-detected | Override server IP (defaults to first non-loopback IPv4) |
| --site-name <name> | (none) | Optional website nickname (e.g., DVFR, i2i). Used as a human-friendly identifier alongside --server-name. |
| --public-url-base <url> | (none) | Base URL where files are publicly served. Adds a clickable Public URL column to CSV and HTML reports. |
| --no-hash | (off) | Skip SHA-256 hashing (much faster, but no dedup) |
| --no-introspect | (off) | Skip PDF/Office introspection (filesystem stats only) |
| --max-introspect-mb <n> | 200 | Skip introspection for files larger than this |
| --include-ext <list> | (all) | Comma-separated extensions to include |
| --exclude-ext <list> | (none) | Comma-separated extensions to exclude |
| --concurrency <n> | 4 | Parallel introspection/hashing workers |
| --progress | (off) | Emit progress to stderr |
| --quiet | (off) | Suppress non-error output |
Exit codes. 0 success, 1 argument or runtime error, 2 directory not readable, 3 partial completion.
filecap rollup <files...>
Merge multiple per-server NDJSONs into a consolidated inventory.
| Flag | Default | Description |
|---|---|---|
| -o, --output <path> | consolidated.ndjson | Output path |
| --strict | (off) | Fail on schema mismatch or missing footer in any input (default: warn and skip) |
filecap report <inventory>
Generate vendor handoff package (CSV + summary + flagged lists) from an inventory NDJSON (single-instance or consolidated).
| Flag | Default | Description |
|---|---|---|
| -o, --output <dir> | ./filecap-report-<ts>/ | Output directory |
| --html | (off) | Also write a self-contained sortable dark-mode HTML report (audit-file-list.html) |
filecap web-rollup
Bundle the most recent scans of every saved site into a static-site directory ready for Netlify or any static host.
| Flag | Default | Description |
|---|---|---|
| -o, --output <dir> | ~/filecap-audits/_web-rollup/<ts>/ | Output directory |
| --password <pw> | (none) | Embed SHA-256 of this password in a client-side gate on every page |
| --no-client-gate | (off) | Skip the client-side gate JS. Use with Netlify dashboard Site Password for server-side enforcement. |
| --deploy | (off) | After building, run netlify deploy --prod automatically. Requires Netlify CLI installed and logged in. |
| --deploy-site <site-id> | (none) | Pass --site <id> to netlify deploy (for non-linked sites) |
| --title <title> | "filecap fleet audit snapshot" | Title shown on the index page |
| --include-site <name...> | (all sites) | Only bundle these site nicknames |
| --exclude-site <name...> | (none excluded) | Skip these site nicknames |
| --sites-file <path> | ~/.filecap/sites.json | Override saved-sites JSON path |
When --no-client-gate is passed without --password, the bundle is open by design. When both are passed, --password is ignored (a warning is printed) — the bundle has no embedded gate and Netlify Site Password provides the protection.
filecap mcp
Starts an stdio MCP server for use with AI agent clients (Claude Desktop, Claude Code, Cursor, etc.). No flags — configuration is handled by the client.
Multi-server workflow
When scanning multiple servers from a single coordinator with SSH access:
ssh deploy@strapi-prod-01 "npx --yes @icjia/filecap scan /var/strapi/uploads -o -" > ./inventories/strapi-prod-01.ndjsonThe -o - flag writes NDJSON to stdout, which SSH transports back. Compute (walk, hash, introspection) happens on the remote; only the inventory output crosses the network.
A sample bash orchestrator is in examples/multi-scan.sh.
NDJSON output format
Line-delimited JSON. First line: header (scan metadata). Last line: footer (summary stats). Lines in between: one per file.
Example header:
{
"schemaVersion": 1,
"kind": "filecap-inventory-header",
"metadata": {
"siteName": "DVFR",
"serverName": "dvfr-strapi-prod",
"hostname": "dvfr-strapi-prod",
"serverIp": "192.241.146.85",
"scannedPath": "/var/strapi/uploads",
"publicUrlBase": "https://dvfr.icjia-api.cloud/uploads",
"scannedAt": "2026-05-09T14:23:11.000Z",
"filecapVersion": "1.2.0",
"nodeVersion": "20.19.0",
"options": { "introspect": true, "hash": true, "maxIntrospectMb": 200, "concurrency": 4 }
}
}siteName and publicUrlBase are optional. Omitting them is valid. Old inventories without them continue to validate.
As of 1.7.0: sites.json entries also accept an optional siteFullName — a verbose, human-readable name like "Domestic Violence Fatality Review" alongside the short nickname "DVFR" in siteName. The full name is rendered as the card title on the fleet index and the <h1> on the per-site detail page; sites without siteFullName keep using siteName as the title. siteFullName lives in sites.json, not in the inventory header — it's a per-publication choice, not a per-scan property.
Example file entry (PDF):
{
"path": "2024/reports/annual-report.pdf",
"absolutePath": "/var/strapi/uploads/2024/reports/annual-report.pdf",
"filename": "annual-report.pdf",
"extension": "pdf",
"category": "pdf",
"remediable": true,
"sizeBytes": 4827193,
"modifiedAt": "2024-03-12T09:14:22.000Z",
"sha256": "e3b0c44...",
"flags": [],
"introspection": {
"kind": "pdf",
"pageCount": 48,
"hasTextLayer": true,
"textLayerCoverage": 1.0,
"isImageOnly": false,
"hasTags": false,
"hasFormFields": false,
"hasSignatures": false,
"encrypted": false,
"documentLanguage": "en-US"
}
}Example file entry (DOCX):
{
"path": "2024/policies/handbook.docx",
"absolutePath": "/var/strapi/uploads/2024/policies/handbook.docx",
"filename": "handbook.docx",
"extension": "docx",
"category": "office-document",
"remediable": true,
"sizeBytes": 152340,
"modifiedAt": "2024-06-15T13:00:00.000Z",
"sha256": "a1b2c3d4...",
"flags": [],
"introspection": {
"kind": "docx",
"hasHeadings": true,
"imageCount": 5,
"altTextCoverage": 0.8,
"tableCount": 3,
"tablesHaveHeaders": true,
"vagueLinkCount": 2,
"documentLanguage": "en-US"
}
}Example file entry (XLSX):
{
"path": "2024/data/budget.xlsx",
"absolutePath": "/var/strapi/uploads/2024/data/budget.xlsx",
"filename": "budget.xlsx",
"extension": "xlsx",
"category": "spreadsheet",
"remediable": true,
"sizeBytes": 48720,
"modifiedAt": "2024-04-01T09:00:00.000Z",
"sha256": "f9e8d7c6...",
"flags": [],
"introspection": {
"kind": "xlsx",
"sheetCount": 4
}
}Example file entry (legacy .doc):
{
"path": "archive/2010-memo.doc",
"filename": "2010-memo.doc",
"extension": "doc",
"category": "office-document",
"remediable": true,
"introspection": {
"kind": "office-legacy",
"format": "doc"
}
}The presence of kind: "office-legacy" is itself the signal: this file needs manual review with Office or an upgrade to a modern format before remediation.
What gets introspected
| Field | What it tells you |
|---|---|
| pageCount, hasTextLayer, textLayerCoverage, isImageOnly | Text vs. scanned content |
| hasTags | PDF structure tags (most important PDF a11y feature) |
| hasFormFields, hasSignatures | Specialized remediation requirements |
| encrypted | Whether the file is password-protected |
| documentLanguage | Declared language (WCAG 3.1.1) |
DOCX
| Field | What it tells you |
|---|---|
| hasHeadings | Document uses Word heading styles (essential for screen-reader navigation) |
| imageCount, altTextCoverage | Number of images and what fraction have alt text |
| tableCount, tablesHaveHeaders | Table count and whether any table has marked header rows |
| vagueLinkCount | Links using ambiguous text ("click here", "read more") |
| documentLanguage | Declared language (WCAG 3.1.1) |
XLSX
| Field | What it tells you |
|---|---|
| sheetCount | Total number of sheets |
Legacy .doc/.ppt/.xls
Flagged by extension only — kind: "office-legacy" with the specific format. These binary formats need Office or specialized tools to inspect.
When introspection fails (corrupt file, unsupported variant, parse exception), the introspection field is omitted from the entry. The file row still appears with full filesystem stats.
Files larger than --max-introspect-mb (default 200) skip introspection regardless of type.
Filename flags (Phase 4)
Every entry's flags[] array is populated with applicable filename-heuristic flags. These drive the flagged_filenames.txt artifact in every report:
| Flag | When applied |
|---|---|
| scanned-name-pattern | Filename matches scanner / photo / default-output naming: Scan_001.pdf, IMG_4567.jpg, Document1.docx, Untitled-1.pdf, 12345.tiff, DOC001.pdf, FAX-2024-04-12.pdf, Microsoft Word - draft.pdf, etc. Strong signal that the file is an unprocessed export from a scanner, phone camera, or default save-as. |
| filename-has-spaces | Basename contains whitespace. URL-encoded spaces (%20) are a common source of CMS friction and copy-paste bugs. |
| filename-non-ascii | Basename contains characters outside the printable ASCII range (e.g., résumé.pdf, 文件.docx). Web-server URL handling and some legacy systems still mishandle these. |
| filename-long | Basename exceeds 200 characters. Long names cause filesystem truncation and URL length issues. |
Flags are emitted as a sorted array. The flags column was removed from the CSV and HTML report in v1.1.0 — it is now used only to populate flagged_filenames.txt. A file with no triggered flags has flags: [] (empty array).
Rollup workflow (Phase 5)
After scanning N servers, merge the per-server NDJSONs into a consolidated inventory:
filecap rollup ./inventories/*.ndjson -o consolidated.ndjsonThe consolidated NDJSON has the same line-delimited structure as a single-instance inventory but with three differences:
- Header.
kind: "filecap-consolidated-header"andmetadata.sourcesis an array with one entry per source inventory (each carrying the original server identity, scan options, and stats). - Entries. Each entry gains
serverName: string(which source it came from) andduplicateOf: {serverName, path} | null. Content-duplicates (identical SHA-256 across servers) getduplicateOfset to the canonical copy. The canonical entry hasduplicateOf: null. - Footer.
kind: "filecap-consolidated-footer"with cross-instance stats:totalUniqueHashes,totalDuplicateGroups,bytesSavedIfDeduped(bytes that could be reclaimed by deleting non-canonical duplicates).
Why one row per physical copy? Each duplicate entry in the consolidated CSV represents real disk space someone has to decide to keep or delete. The duplicateOf link tells the consumer "this is the same content as <serverName>:<path>" so a vendor can group by hash for de-dup analysis OR filter to canonicals only for remediation work. Both views are one query away.
Canonical-pick rule. When two or more entries share a SHA-256, the canonical is the one with the oldest modifiedAt. Ties are broken alphabetically by serverName. The canonical entry has duplicateOf: null; all others have duplicateOf: {serverName, path} pointing at it.
Example consolidated entry (canonical):
{
"path": "2024/case-001.pdf",
"filename": "case-001.pdf",
"extension": "pdf",
"category": "pdf",
"remediable": true,
"sizeBytes": 4827193,
"modifiedAt": "2024-03-12T09:14:22.000Z",
"sha256": "e3b0c44...",
"flags": [],
"serverName": "strapi-prod-01",
"duplicateOf": null
}Example consolidated entry (duplicate):
{
"path": "archive/case-001-copy.pdf",
"filename": "case-001-copy.pdf",
"extension": "pdf",
"category": "pdf",
"remediable": true,
"sizeBytes": 4827193,
"modifiedAt": "2024-08-01T12:30:00.000Z",
"sha256": "e3b0c44...",
"flags": [],
"serverName": "strapi-prod-02",
"duplicateOf": { "serverName": "strapi-prod-01", "path": "2024/case-001.pdf" }
}Report workflow (Phase 6)
Generate the vendor handoff package from an inventory NDJSON (single-instance or consolidated):
filecap report consolidated.ndjson -o ./report-2026-Q2/
filecap report consolidated.ndjson -o ./report-2026-Q2/ --htmlOutput directory contents:
| File | Purpose |
|---|---|
| audit-file-list.csv | One row per file, 16 columns (14 file-descriptor + 2 staff-fill — the work-order vendors actually consume + the staff-edit columns added in v1.7.16). Human-readable column headers. Filterable in Excel, Smartsheet, etc. |
| audit-file-list.html | (Only when --html is passed.) Self-contained interactive dark-mode page — same data, sortable columns, full-text search, category filter chips, no external dependencies. audit-remote.sh always passes --html unless AUDIT_HTML=0 is set. |
| audit-summary.txt | Manager-friendly top-line numbers: file counts by category, total bytes, image-only PDF count, remediable count, heading coverage, alt-text coverage, and "What this means" observation bullets. |
| README.txt | Plain-text guide to all files in this folder. Start here if you're not sure which file to open. |
| largest_files.txt | Top 50 files by size (helps schedule the biggest remediation work) |
| flagged_filenames.txt | Files whose flags[] includes scanned-original or filename anti-patterns |
| duplicate_hashes.txt | Content-duplicate groups (entries sharing a SHA-256) — useful for de-dup analysis |
| pdf_image_only.txt | PDFs with isImageOnly: true — the headline cost driver in PDF remediation |
The CSV is pure inventory — there are NO vendor-fill columns. Vendors return remediated files; ICJIA re-scans and uses a future filecap diff command to detect changes.
CSV column order (16 columns; positions 1–14 stable since v1.4.1 with v1.7.2 moving Public URL to position 4; positions 15–16 added in v1.7.16):
Server, Website, Server IP, Public URL, Date published, Source folder on server, File location (relative to source folder), Full file path on server, File name, File extension, File type, Size (bytes), Content hash (SHA-256), Duplicate of, Delete?, Notes
The deliverable focuses on the fields a remediator needs to find and price each file (filename, path, server, type, size, duplicate marker, public URL). Format-specific introspection columns (PDF page count, image-only/OCR, DOCX heading coverage, XLSX sheet count, etc.) were dropped in v1.4.0 / v1.4.1 — remediators open the file in Adobe Acrobat / Word / Excel and read those properties directly from the file. The full introspection remains in the underlying NDJSON inventory for MCP queries and custom reports.
The last two columns are CSV-only and meant for staff input. Delete? defaults to No for every row; staff types Yes to flag a file for removal before the next audit. Notes is empty by default; free-text for whatever context the staff member wants to leave on a row (why it should stay, why it should go, who owns it, etc.). When the edited CSV comes back, the audit lead removes the Yes-marked files on each source server, reads through the notes, and re-runs the audit. These two columns don't appear in the HTML table view — the web view is informational, the CSV is the actionable artefact. CSV is plain text so the "dropdown" feel for Delete? needs Excel/Google-Sheets data-validation set by the staff member (right-click → Data validation → List of items → No, Yes); without it, staff types Yes or No directly into the cell.
Column headers are human-facing labels (not raw field names). Empty cells indicate the field doesn't apply to this file's type.
Inputs. filecap report accepts BOTH a single-instance NDJSON (from filecap scan) and a consolidated NDJSON (from filecap rollup). Both input shapes produce the same 16-column CSV.
MCP server (Phase 7)
filecap mcp starts an stdio MCP server that exposes five tools AI agents can call during conversational audits:
| Tool | What it does |
|---|---|
| filecap_scan | Walk a directory, produce an NDJSON inventory at the specified path |
| filecap_rollup | Merge multiple per-server NDJSONs into a consolidated inventory |
| filecap_report | Generate vendor handoff package (CSV + summary + flagged lists) |
| filecap_query_inventory | Filter/sort entries in an existing NDJSON by size, extension, flags, isImageOnly, etc. |
| filecap_web_rollup | Bundle the most recent scans of every saved site into a static-site directory |
Always-latest config (recommended)
Pin to @latest (or omit the version tag entirely) so the host re-checks the npm registry each time it spawns the MCP process. This guarantees you pick up new tool definitions and bug fixes without touching your config file:
"args": ["--yes", "@icjia/filecap@latest", "mcp"]All client snippets below use this form.
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json on macOS, or %APPDATA%\Claude\claude_desktop_config.json on Windows. Restart Claude Desktop after saving.
{
"mcpServers": {
"filecap": {
"command": "npx",
"args": ["--yes", "@icjia/filecap@latest", "mcp"]
}
}
}Claude Code
.claude/mcp.json in your project root for project-scoped access, or ~/.claude/mcp.json for user-global access:
{
"mcpServers": {
"filecap": {
"command": "npx",
"args": ["--yes", "@icjia/filecap@latest", "mcp"]
}
}
}Cursor
~/.cursor/mcp.json (also configurable in-app at Settings → Features → MCP):
{
"mcpServers": {
"filecap": {
"command": "npx",
"args": ["--yes", "@icjia/filecap@latest", "mcp"]
}
}
}Windsurf (Codeium)
~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"filecap": {
"command": "npx",
"args": ["--yes", "@icjia/filecap@latest", "mcp"]
}
}
}Continue
~/.continue/config.json for user-global access, or .continue/config.json in your project root for project-scoped access. Continue uses a different
