@cesarandreslopez/occ

v0.8.11

Published

a day ago

Office Cloc and Count — document metrics, structure extraction, content inspection, and code exploration for real repositories

Experimental: All features in OCC are currently experimental. This project cannot be considered stable software yet. APIs, output formats, and command interfaces may change between minor versions.

What is this?

OCC started as a way to make office documents visible in the same workflows that already work well for code metrics tools like scc and cloc. It has since grown into a multi-purpose CLI that can:

scan office documents for word/page/sheet/slide metrics
extract document heading structure for navigation and RAG-style use cases
inspect documents (occ doc inspect), spreadsheets (occ sheet inspect), and presentations (occ slide inspect) for metadata, risk flags, and content previews
extract structured table content from documents (occ table inspect)
analyze workspaces for combined code, document, and structure metrics (occ workspace analyze) and cross-document references (occ workspace documents)
quickly describe whether a directory is a code, office, documentation, data, or mixed project (occ describe, occ workspace describe)
summarize code metrics through scc
explore JavaScript, TypeScript, Vue Single-File Components, and Python repositories with symbol search, call analysis, dependency inspection, and inheritance queries (occ code)

Features

Office document metrics — words, pages, paragraphs, slides, sheets, rows, cells
Seven formats supported — DOCX, XLSX, PPTX, PDF, ODT, ODS, ODP
Document structure extraction — --structure parses heading hierarchy into a navigable tree with dotted section codes (1, 1.1, 1.2, ...)
Document inspection via occ doc inspect — metadata, risk flags, content stats, heading structure, and content preview for DOCX and ODT
Spreadsheet inspection via occ sheet inspect — workbook properties, hidden sheets, names, formulas, links, comments, schema preview, and token estimates for XLSX
Presentation inspection via occ slide inspect — metadata, risk flags, per-slide inventory, and content preview for PPTX and ODP
Table extraction via occ table inspect — structured table content from DOCX, XLSX, PPTX, ODT, and ODP with auto-detected headers, sample row limits, and merged cell support
Code metrics via scc — auto-detects code files and integrates scc output
Code exploration via occ code — JS/TS, Vue SFC, and Python-first symbol lookup, content search, callers/callees, dependency categories, inheritance, module coupling, and ambiguity-aware chains
Repository map / pack via occ code map / occ code pack — token-budgeted, PageRank-ranked repo map (top symbol signatures per file) or content pack, with a pluggable heuristic or BPE (o200k_base / cl100k_base) tokenizer for exact budgets, and opt-in query/path focus (--query, --focus-path, --focus-depth) to bias the map toward a specific task
Workspace analysis via occ workspace — fast directory description plus combined code, document, and structure analysis with versioned JSON contracts, per-document summaries, and cross-reference detection
Multiple output modes — grouped by type, per-file breakdown, or JSON
CI-friendly — ASCII-only, no-color mode for pipelines
Flexible filtering — include/exclude extensions, exclude directories, .gitignore-aware
Progress bar — with ETA for large scans
Zero config — auto-downloads scc binary on install, works out of the box

Quick Start

Global install:

npm i -g @cesarandreslopez/occ
occ

No-install usage:

npx @cesarandreslopez/occ docs/ reports/

From source:

git clone https://github.com/cesarandreslopez/occ.git && cd occ
npm install
npm run build
npm test
npm start

Usage

# Scan current directory
occ

# Scan specific directories
occ docs/ reports/

# Per-file breakdown
occ --by-file docs/

# JSON output
occ --format json docs/

# Extract document structure (heading hierarchy)
occ --structure docs/

# Structure as JSON
occ --structure --format json docs/

# Inspect a document for metadata, risk flags, and content preview
occ doc inspect report.docx
occ doc inspect report.docx --format json

# Inspect an XLSX workbook before reading its contents deeply
occ sheet inspect finance.xlsx
occ sheet inspect finance.xlsx --format json --sample-rows 3 --max-columns 12

# Inspect a presentation for slide inventory and content preview
occ slide inspect deck.pptx
occ slide inspect deck.pptx --format json --slide 3

# Extract structured table data from documents
occ table inspect report.docx --format json
occ table inspect finance.xlsx --table 1 --sample-rows 10

# Explore JS/TS and Python code
occ code find name UserService --path .
occ code analyze callers createUser --path .
occ code analyze deps src/deps --path .
occ code analyze chain ambiguousCaller duplicate --path .

# Module coupling metrics
occ code analyze coupling src/code --path .

# Dump full codebase index as JSON
occ code index --path . --format json

# Token-budgeted, PageRank-ranked repo map (signatures) and pack (content)
occ code map --path . --map-tokens 4096
occ code pack --path . --map-tokens 8192 --tokenizer o200k_base

# Workspace-level analysis (code + documents + structures)
occ workspace analyze --format json

# Quick directory classification
occ describe .
occ workspace describe --format json

# Document summaries with cross-references
occ workspace documents --format json

# Only specific formats
occ --include-ext pdf,docx docs/

# Skip code analysis
occ --no-code docs/

# CI-friendly (ASCII, no color)
occ --ci docs/

Example Output

-- Documents ---------------------------------------------------------------
  Format    Files    Words    Pages                  Details      Size
----------------------------------------------------------------------------
  Word         12   34,210      137              1,203 paras    1.2 MB
  PDF           8   22,540       64                             4.5 MB
  Excel         3                                12 sheets      890 KB
----------------------------------------------------------------------------
  Total        23   56,750      201              1,203 paras    6.5 MB

-- Code (via scc) ----------------------------------------------------------
  Language    Files    Lines   Blanks  Comments     Code
----------------------------------------------------------------------------
  JavaScript     15     2340      180       320     1840
  Python          8     1200       90       150      960
----------------------------------------------------------------------------
  Total          23     3540      270       470     2800

Scanned 23 documents (56,750 words, 201 pages) in 120ms

Structure Output (`--structure`)

-- Structure: report.docx --------------------------------------------------
1   Executive Summary
  1.1   Background ......................................... p.1
  1.2   Key Findings ....................................... p.1-2
2   Methodology
  2.1   Data Collection .................................... p.3
  2.2   Analysis Framework ................................. p.4
    2.2.1   Quantitative Methods ........................... p.4
    2.2.2   Qualitative Methods ............................ p.5
3   Results ................................................ p.6-8
4   Conclusions ............................................ p.9

4 sections, 10 nodes, max depth 3

Supported Formats

| Format | Extension | Metrics | Structure | |--------|-----------|---------|-----------| | Word | .docx | words, pages*, paragraphs | Yes | | PDF | .pdf | words, pages | Yes (with page mapping) | | Excel | .xlsx | sheets, rows, cells | — | | PowerPoint | .pptx | words, slides | Yes (slide headers) | | ODT | .odt | words, pages*, paragraphs | Yes (best-effort) | | ODS | .ods | sheets, rows, cells | — | | ODP | .odp | words, slides | Yes (slide headers) |

* Pages for Word/ODT are estimated at 250 words/page.

CLI Flags

| Flag | Description | Default | |------|-------------|---------| | --by-file / -f | Row per file | grouped by type | | --format <type> | tabular or json | tabular | | --structure | Extract and display document heading hierarchy | off | | --include-ext <exts> | Comma-separated extensions | all supported | | --exclude-ext <exts> | Comma-separated to skip | none | | --exclude-dir <dirs> | Directories to skip | node_modules,.git | | --ignore-pattern <pattern> | Gitignore-style pattern to ignore (repeatable) | none | | --no-gitignore | Disable .gitignore respect | enabled | | --sort <col> | Sort by: files, name, words, size | files | | --output <file> / -o | Write to file | stdout | | --ci | ASCII-only, no color | off | | --large-file-limit <mb> | Skip files over this size | 50 | | --no-code | Skip scc code analysis | off | | --show-confidence | Show confidence levels for each metric | off |

Code Exploration

occ code adds on-demand code exploration without changing the existing document-scan workflow. It builds an in-memory repository graph for each command and does not require a database, daemon, or background indexer.

The first-class support path is JavaScript, TypeScript, Vue Single-File Components, and Python. Other languages may be discovered and partially parsed, but the current resolver, fixtures, and output contracts are intentionally optimized around JS/TS, Vue SFC, and Python behavior.

# Exact symbol lookup
occ code find name Greeter --path test/fixtures/code-explore

# Substring search
occ code find pattern service --path .

# Full-text content search
occ code find content normalize_name --path .

# Outgoing and incoming call analysis
occ code analyze calls bootstrap --path test/fixtures/code-explore
occ code analyze callers createUser --path test/fixtures/code-explore

# Dependency and inheritance inspection
occ code analyze deps src/service --path test/fixtures/code-explore
occ code analyze tree UserService --path test/fixtures/code-explore

# Module coupling analysis
occ code analyze coupling src/code --path test/fixtures/code-explore

# Ambiguity-aware chain analysis
occ code analyze chain ambiguousCaller duplicate --path test/fixtures/code-explore

# Token-budgeted, importance-ranked repo map and pack
occ code map --path . --map-tokens 4096
occ code pack --path . --map-tokens 8192 --tokenizer o200k_base

# Focus the map on a task — by query and/or path (with graph neighbors)
occ code map --path . --query "workspace context injection" --focus-path src/services/memory --map-tokens 4096

Highlights of the current code exploration behavior:

Full index export via occ code index — dump the complete graph (files, symbols, edges, language capabilities) as JSON or a summary line
Repository map and pack via occ code map / occ code pack — PageRank-ranked, token-budgeted output (symbol signatures or file content) that greedily admits the highest-ranked files within the budget, shrinking content or shedding low-rank symbols rather than dropping a file that nearly fits, with a pluggable heuristic or BPE tokenizer (--tokenizer o200k_base|cl100k_base)
Focused maps via --query, --focus-path (repeatable), and --focus-depth — opt-in, additive relevance scoring (query + path + graph neighbors) blended with the global rank so task-relevant files surface within the budget; omitting the flags keeps the global PageRank overview unchanged
Exact, pattern, type, and content search over the repository graph
Call analysis with explicit resolved, ambiguous, and unresolved states
Receiver-aware method resolution for this, super, self, and cls
Dependency analysis grouped into local, external, and unresolved imports
Module coupling analysis with afferent/efferent coupling, instability, and key classes
Chain analysis that reports when a path is blocked by ambiguity instead of silently returning nothing
Shared CLI ergonomics with --path, --format, --output, --exclude-dir, and .gitignore support

All occ code commands support --format tabular|json. Most symbol-targeted commands also support --file for disambiguation, and JSON output includes repository metadata, query metadata, results, repository stats, and per-language capability flags.

Programmatic Usage

The code exploration module is available as a library via subpath exports:

import { buildCodebaseIndex } from '@cesarandreslopez/occ/code/build';
import { discoverCodeFiles } from '@cesarandreslopez/occ/code/discover';
import { findByName, analyzeCalls } from '@cesarandreslopez/occ/code/query';
import type { CodebaseIndex, CodeNode } from '@cesarandreslopez/occ/code/types';

const index = await buildCodebaseIndex({ repoRoot: './my-repo' });
const results = findByName(index, 'UserService');

For a stateful session that caches the index across queries:

import { createCodeQuerySession } from '@cesarandreslopez/occ/code/session';

const session = await createCodeQuerySession({ repoRoot: './my-repo' });
session.findByName('UserService');
session.analyzeCalls('bootstrap');
session.chunk({ maxChunkWords: 200 });
await session.refresh(); // rebuild index when files change

For persistent caching across sessions with automatic freshness checks:

import { openCodeIndexStore } from '@cesarandreslopez/occ/code/store';

const store = openCodeIndexStore({
  repoRoot: './my-repo',
  cacheDir: '.occ-cache',
});

// First call builds + caches; subsequent calls load from cache
const session = await store.getSession({ strategy: 'prefer-cache' });
session.findByName('UserService');

// Check freshness via file manifests before returning cache
await store.getSession({ strategy: 'ensure-fresh' });

// Force a full rebuild
await store.refresh();

Or use the unified facade for all OCC APIs from a single import:

import { createOcc } from '@cesarandreslopez/occ';

const occ = createOcc();
const session = await occ.code.createSession({ repoRoot: './my-repo' });
const description = await occ.workspace.describe('./my-project');
const analysis = await occ.workspace.analyze('./my-project', { includeCode: true });
const doc = await occ.doc.inspect('report.docx', {});

For large or untrusted repositories, preview the size first, then build the index in an isolated subprocess with a budget. Pair with the slim variant when the consumer only needs the graph (no source content):

import { createOcc, CodeIndexBudgetExceededError } from '@cesarandreslopez/occ';

const occ = createOcc();

const preview = await occ.code.previewSize({ repoRoot: './my-repo', maxFiles: 5_000, maxBytes: 32 * 1024 * 1024 });
if (preview.exceedsBudget) {
  console.warn('Repo exceeds preview budget:', preview.exceedsBudget);
}

try {
  const slim = await occ.code.buildIndexIsolated({
    repoRoot: './my-repo',
    contentMode: 'excerpt',
    maxFiles: 5_000,
    maxBytes: 32 * 1024 * 1024,
    onBudgetExceeded: 'truncate',
    slim: true,
  });
  if (slim.truncated) console.warn('Index truncated:', slim.truncated);
} catch (error) {
  if (error instanceof CodeIndexBudgetExceededError) {
    console.error('Build budget exceeded:', error.budget);
  } else {
    throw error;
  }
}

For workspace-level analysis:

import { analyzeWorkspace } from '@cesarandreslopez/occ/workspace/analyze';
import { describeWorkspace } from '@cesarandreslopez/occ/workspace/describe';
import { inspectWorkspaceDocumentSet } from '@cesarandreslopez/occ/workspace/documents';

const description = await describeWorkspace('./my-project');
const analysis = await analyzeWorkspace('./my-project', { includeCode: true });
const docs = await inspectWorkspaceDocumentSet('./my-project', { maxFiles: 20 });

For a single end-to-end workspace bundle (description + analysis + documents + code preview + slim code index + outline + cross-references in one call):

import { createOcc } from '@cesarandreslopez/occ';

const occ = createOcc();
const bundle = await occ.workspace.bundle('./my-project', {
  includeDocumentChunks: true,
  maxCodeFiles: 5_000,
  maxCodeBytes: 32 * 1024 * 1024,
});

console.log(bundle.outline);                  // root → projects → modules → documents → sections
console.log(bundle.codeDocumentReferences);   // per-symbol matches inside markdown content
console.log(bundle.documentChunks?.length);   // heading-aware token-budgeted document chunks

For heading-aware token-budgeted document chunking (RAG-friendly):

import { chunkDocument } from '@cesarandreslopez/occ/doc/chunk';

const chunks = await chunkDocument('./report.docx', { maxTokens: 800, overlapTokens: 80 });
for (const chunk of chunks) {
  console.log(chunk.chunkId, chunk.headingPath.join(' › '), chunk.tokenEstimate);
}

For Mermaid diagrams and module summaries from a code index:

import { createOcc } from '@cesarandreslopez/occ';

const occ = createOcc();
const session = await occ.code.createSession({ repoRoot: './my-repo' });
const summary = session.summarizeModule('src/code', { maxFunctions: 15 });
const importGraph = session.toMermaid('import-graph', 'src/code');
const classDiagram = session.toMermaid('class-hierarchy', 'src/code');
const callGraph = session.toMermaid('call-graph', 'bootstrap');

Available subpath exports:

| Import path | Description | |-------------|-------------| | @cesarandreslopez/occ/code/build | buildCodebaseIndex — graph construction (with optional maxFiles/maxBytes/onBudgetExceeded budget controls and CodeIndexBudgetExceededError); also parseCodeFiles (bounded-concurrency parse pool) + assembleCodebaseIndex for staged indexing | | @cesarandreslopez/occ/code/types | TypeScript types (CodebaseIndex, CodeNode, CodeEdge, IndexTruncation, etc.) | | @cesarandreslopez/occ/code/query | Query functions (findByName, analyzeCalls, analyzeDeps, etc.) | | @cesarandreslopez/occ/code/discover | discoverCodeFiles — file discovery | | @cesarandreslopez/occ/code/preview | previewCodebaseSize — cheap size + per-language estimate without parsing, with budget pressure reporting | | @cesarandreslopez/occ/code/isolated | buildCodebaseIndexIsolated — runs buildCodebaseIndex in a forked subprocess, optionally returning a slim index | | @cesarandreslopez/occ/code/incremental | computeManifestDiff, classifyChangedFiles, findResolutionImpactedFiles, spliceIndexInputs — primitives behind incremental update()/ensureFresh() (manifest diff + resolution-impact + graph splice) | | @cesarandreslopez/occ/code/slim | slimifyIndex, parseSlimIndex, validateSlimIndex — content-free CodebaseIndexSlim for graph-only consumers | | @cesarandreslopez/occ/code/chunk | chunkCodebase, chunkFromIndex — semantic code chunking (word- or token-based via maxTokens/countTokens) | | @cesarandreslopez/occ/code/session | createCodeQuerySession — stateful code query session | | @cesarandreslopez/occ/code/store | openCodeIndexStore, openChunkCodeIndexStore — persistent index store with cache strategies (the chunk variant pins contentMode: 'full'; update(changedFiles?) does a true incremental refresh — manifest diff + targeted re-parse + graph splice) | | @cesarandreslopez/occ/code/cache | Index caching utilities (legacy — prefer ./code/store) | | @cesarandreslopez/occ/doc/inspect | inspectDocument — document metadata and content extraction | | @cesarandreslopez/occ/doc/types | Document inspection types | | @cesarandreslopez/occ/doc/discover | Document file discovery (now includes prose formats by default; opt into data files via includeDataFiles; new discoverDocumentSet returns { documents, skipped }) | | @cesarandreslopez/occ/doc/batch | Batch document inspection | | @cesarandreslopez/occ/doc/chunk | chunkDocument / chunkDocumentFromMarkdown — heading-aware, token-budgeted document chunker for DOCX/PDF/PPTX/XLSX/ODT/ODS/ODP/MD/MDX/TXT/RST/AsciiDoc (the FromMarkdown variant chunks pre-converted markdown without re-parsing) | | @cesarandreslopez/occ/doc/entities | Entity and keyword extraction | | @cesarandreslopez/occ/doc/references | Cross-reference detection | | @cesarandreslopez/occ/workspace/analyze | analyzeWorkspace — workspace-level analysis | | @cesarandreslopez/occ/workspace/describe | describeWorkspace — fast directory classification | | @cesarandreslopez/occ/workspace/documents | inspectWorkspaceDocumentSet — document summaries and cross-references | | @cesarandreslopez/occ/workspace/types | Workspace analysis types | | @cesarandreslopez/occ/workspace/prepare | prepareWorkspaceContext — combined code indexing + document inspection | | @cesarandreslopez/occ/workspace/prepare-types | Workspace preparation types (WorkspacePrepareOptions, WorkspacePreparedContext, etc.) | | @cesarandreslopez/occ/workspace/bundle | bundleWorkspace — single-call versioned WorkspaceBundle (description + analysis + documents + slim code index + outline + cross-refs + optional document chunks) | | @cesarandreslopez/occ/markdown/convert | documentToMarkdown — document-to-markdown conversion (now reads MD/MDX/TXT/RST/AsciiDoc/YAML/JSON/TOML directly) | | @cesarandreslopez/occ/structure/extract | extractFromMarkdown — heading tree extraction | | @cesarandreslopez/occ/structure/types | Structure types and helpers | | @cesarandreslopez/occ/sheet/inspect | inspectWorkbook — XLSX workbook inspection | | @cesarandreslopez/occ/sheet/types | Sheet inspection types | | @cesarandreslopez/occ/slide/inspect | inspectPresentation — presentation inspection | | @cesarandreslopez/occ/table/inspect | Table extraction from documents | | @cesarandreslopez/occ/table/types | Table extraction types | | @cesarandreslopez/occ/health | health() — liveness probe with version + capability flags | | @cesarandreslopez/occ/errors | OccAbortError, OCC_ABORTED, isOccAbortError — typed abort error that survives instanceof across forked subprocesses | | @cesarandreslopez/occ/types | Shared types (ConfidenceLevel, ParseResult, ParserOutput, etc.) | | @cesarandreslopez/occ/tokens | Token estimation utilities | | @cesarandreslopez/occ/progress-event | Progress event types (ProgressPhase, ProgressEvent with optional scope/currentPath/bytesProcessed/totalBytes/startedAt/elapsedMs) | | @cesarandreslopez/occ/stats | Stats types (StatsRow, AggregateResult) and aggregate() |

TypeScript ships with OCC as a direct dependency, so the code exploration module works after a normal install. You only need a separate TypeScript setup if your own project uses tsc.

Document Inspection

occ doc inspect extracts metadata, risk flags, content stats, heading structure, and a content preview from DOCX and ODT documents.

# Document overview with content preview
occ doc inspect report.docx

# Machine-readable payload
occ doc inspect report.docx --format json

# More paragraphs in the preview
occ doc inspect report.docx --sample-paragraphs 10

Current document inspection surfaces:

Document properties — title, author, dates, keywords
Risk flags — comments, tracked changes, hyperlinks, embedded objects, macros, tables, encryption
Content stats — words, pages, paragraphs, characters, tables, images
Heading structure — tree with section codes and depth
Content preview — first N paragraphs with heading detection
Token estimates — preview and full-document token estimates

Spreadsheet Inspection

occ sheet inspect is a lightweight XLSX preflight command aimed at both humans and agents. It helps answer "is this workbook worth reading in depth?" before spending tokens serializing cells or opening the file in Excel.

# Workbook-level summary + per-sheet schema/sample preview
occ sheet inspect finance.xlsx

# Machine-readable inspection payload
occ sheet inspect finance.xlsx --format json

# Narrow to one sheet and reduce preview width
occ sheet inspect finance.xlsx --sheet Revenue --sample-rows 3 --max-columns 8

Current XLSX inspection highlights:

Workbook metadata — file size, workbook properties, custom properties, workbook-scoped names
Sheet inventory — visible / hidden / very hidden sheets, used ranges, cell counts, formula/comment/link counts
Schema preview — detected header row, inferred column types, coverage ratios, example values
Lightweight sampling — small row previews designed for preflight rather than full extraction
Token estimates — sample and full-sheet token estimates to guide downstream agent reads

Presentation Inspection

occ slide inspect provides presentation metadata, risk flags, per-slide inventory, and content previews for PPTX and ODP files.

# Presentation overview with slide preview
occ slide inspect deck.pptx

# Machine-readable payload
occ slide inspect deck.pptx --format json

# Inspect a specific slide
occ slide inspect deck.pptx --slide 3

Current presentation inspection surfaces:

Presentation properties — title, author, dates
Risk flags — comments, speaker notes, hyperlinks, embedded media, animations, macros, charts, tables
Slide inventory — per-slide title, word count, notes, images, tables, charts
Content preview — text preview for sample slides
Token estimates — preview and full-presentation token estimates

Table Extraction

occ table inspect extracts structured table content from DOCX, XLSX, PPTX, ODT, and ODP documents. For AI agents, this is the primary way to read financial summaries, comparison matrices, and data tables without parsing raw XML.

# Extract all tables as JSON
occ table inspect report.docx --format json

# Tabular preview of table content
occ table inspect finance.xlsx

# Extract a specific table
occ table inspect finance.xlsx --table 1

# Limit sample rows
occ table inspect report.docx --sample-rows 5

Current table extraction highlights:

Multi-format support — DOCX (via mammoth HTML), XLSX (via SheetJS), PPTX (from slide XML), ODT and ODP (from content.xml)
Auto-detected headers — first row is treated as headers when values are unique strings
Merged cell support — colspan and rowspan are preserved in the output
Sample row limits — configurable maximum rows per table (default: 20)
Table filtering — extract a specific table by index with --table N
Token estimates — per-table and total token estimates
PDF graceful degradation — returns empty tables with an informative note instead of unreliable heuristic output

Workspace Analysis

occ workspace provides fast directory description plus combined analysis of code, documents, and structures — useful for AI agents that need a complete workspace overview.

# Quick workspace description from fast signals
occ workspace describe
occ workspace describe --format json

# Convenience alias for one or more directories
occ describe ./repo-a ./repo-b --quiet

# Full workspace analysis (code + documents + structures)
occ workspace analyze --format json

# Skip code analysis
occ workspace analyze --no-code --format json

# Document summaries with cross-reference detection
occ workspace documents --format json

# Limit documents and include markdown content
occ workspace documents --max-files 20 --include-markdown --format json

occ workspace describe returns a schemaVersion: 1 JSON envelope containing a classification summary, file composition, manifest/framework signals, nested project summaries, and recommended next OCC commands. It uses fast file and manifest signals only: it does not parse document contents or run scc. occ workspace analyze returns code metrics (via scc), document aggregates, heading structures, skipped files, and errors. occ workspace documents returns per-document summaries with cross-references (filename mentions, hyperlinks, citations) and unresolved mentions detected across the document set.

For combined code indexing and document inspection with progress tracking:

import { prepareWorkspaceContext } from '@cesarandreslopez/occ/workspace/prepare';

const context = await prepareWorkspaceContext('./my-project', {
  includeCode: true,
  includeDocuments: true,
  executionMode: 'auto', // 'auto' | 'inline' | 'subprocess'
}, (event) => {
  console.log(`[${event.scope}] ${event.stage}: ${event.completed}/${event.total}`);
});

// context.code?.index  — full CodebaseIndex
// context.documents    — WorkspaceDocumentSet with cross-references
// context.elapsedMs    — total wall time
// context.errors       — collected errors from both phases

Documentation

Full documentation is available at cesarandreslopez.github.io/occ, including:

Why OCC?

Tools like scc, cloc, and tokei give you instant visibility into codebases — lines, languages, complexity. But most projects also contain Word documents, PDFs, spreadsheets, and presentations that are invisible to these tools. OCC fills that gap.

For Humans

Project audits — instantly see how much documentation lives alongside your code: total word counts, page counts, spreadsheet sizes, and presentation lengths
Tracking documentation growth — run OCC in CI to monitor how documentation scales over time, catch bloat early, or enforce minimums
Onboarding — new team members get a quick sense of a project's documentation footprint before diving in
Migration planning — when moving to a new platform, know exactly what you're dealing with across hundreds of files and formats

For AI Agents

Context budgeting — LLMs have finite context windows. OCC's word and page counts let agents estimate how much of a document set they can ingest before hitting token limits
Prioritization — an agent deciding which documents to read can use OCC's JSON output to rank files by size, word count, or type, focusing on the most relevant content first
RAG chunk mapping — --structure --format json outputs heading trees with character offsets, enabling chunk-to-section mapping, scoped retrieval, and citation paths in RAG pipelines
Document triage — occ doc inspect --format json surfaces risk flags, content stats, structure, and token estimates before an agent reads the full document
Spreadsheet triage — occ sheet inspect --format json exposes sheet visibility, formulas, links, comments, schema hints, and token estimates before an agent expands workbook contents
Presentation triage — occ slide inspect --format json provides slide inventory, risk flags, and content previews for quick assessment
Table extraction — occ table inspect --format json extracts structured table data (headers, rows, cells) from documents, giving agents direct access to tabular content without parsing raw XML
Repository mapping — agents exploring an unfamiliar codebase can combine occ --format json for document inventory with occ code ... --format json for symbol and relationship data
Pipeline integration — JSON output pipes directly into agent toolchains for automated document analysis, summarization, or compliance checking

How It Works

OCC is written in TypeScript and uses fast-glob for file discovery, dispatches to format-specific parsers (mammoth for DOCX, pdf-parse for PDF, SheetJS for XLSX, JSZip + officeparser for PPTX/ODF), aggregates metrics, and renders output via cli-table3. For code metrics, it shells out to a vendored scc binary (auto-downloaded during npm install, with PATH fallback).

For structure extraction (--structure), documents are first converted to markdown (mammoth + turndown for DOCX, pdf-parse with page markers for PDF), then headers are extracted and assembled into a tree with dotted section codes.

For occ code, OCC builds an in-memory code graph on demand. JavaScript and TypeScript are parsed with the TypeScript compiler API, Vue Single-File Components are split with @vue/compiler-sfc and forwarded into the same TS pipeline, Python uses a language-specific parser, and the query engine resolves symbols, imports, calls, inheritance, ambiguities, and dependency categories without a persistent database.

Contributing

Contributions are welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT