analyze-qna

v0.1.5

Published

6 months ago

Analyze Q&A YAML files and report token counts and structure quality

0High
0Medium
0Low

rdwj

qna yaml tokens analysis

analyze-qna

CLI to analyze InstructLab qna.yaml files and report token counts, structure quality, and formatting lint. This tool is intended to help authors ensure their datasets pass InstructLab validations before running ilab taxonomy diff.

Features

Token counting: Uses tiktoken (OpenAI).
Rules enforced:
- Context ~500 tokens (default warn if outside 300–500)
- Each Q/A pair ~250 tokens (default warn if outside 200–300)
- Context + all pairs ≤ 750 tokens (error if over)
- 1-3 Q/A pairs per context (3 is optimal; extras ignored with warning)
- 5–15 sections required (warning otherwise)
- Optional checks: Q and A text present in context; context appears in source doc (when provided)
LLM-powered analysis (NEW):
- Grounding checks: Validates that answers are actually based on the provided context
- Q&A suggestions: Automatically generates additional Q&A pairs when fewer than 3 exist
- Source verification: Fetches documents from git repos and validates context accuracy
- Supports Ollama and OpenAI-compatible endpoints (including vLLM)
Configurable thresholds: via --config or CLI flags (see below)
Stronger source matching: normalized substring + line-based fraction matching
Directory crawl: --taxonomy-root crawls a tree and analyzes files named qna.yaml.
Readable report: Pretty table output via tabulate with per-pair breakout.
Agent mode: --ai emits structured JSON for programmatic use.
YAML lint: --yaml-lint checks trailing whitespace, missing final newline, tabs/mixed indentation, CRLF endings, and duplicate keys.

Installation

Local development:
- python -m venv .venv && source .venv/bin/activate
- pip install -r requirements.txt
After publish (or via npm link), use npx:
- npx analyze-qna --help

Usage

Direct Python
- python src/analyze_qna.py --file path/to/qna.yaml
- python src/analyze_qna.py --file path/to/qna.yaml --ai
- python src/analyze_qna.py --file path/to/qna.yaml --ai --source-doc path/to/source.txt
- python src/analyze_qna.py --taxonomy-root path/to/taxonomy
- python src/analyze_qna.py --taxonomy-root path/to/taxonomy --ai
- python src/analyze_qna.py --file path/to/qna.yaml --yaml-lint
- python src/analyze_qna.py --taxonomy-root path/to/taxonomy --yaml-lint
- python src/analyze_qna.py --data-dir path/to/dir (deprecated)
Via npx (after publishing or via npm link)
- npx analyze-qna --file path/to/qna.yaml
- npx analyze-qna --taxonomy-root path/to/taxonomy
- npx analyze-qna --file path/to/qna.yaml --yaml-lint

Configuration

LLM Configuration (New!)

analyze-qna now supports LLM-powered features including grounding checks and Q&A suggestions:

Interactive setup: npx analyze-qna config init
- Creates config at ~/.config/analyze-qna/config.yaml
- Supports Ollama and OpenAI-compatible endpoints
- Configures features like grounding checks and Q&A suggestions
- Max tokens limit: up to 32,000 for larger models
Validate configuration: npx analyze-qna config validate [config-file]
- Tests LLM connectivity and model availability
- Verifies API keys and endpoints
- Shows enabled features and thresholds
Use LLM features:
- After running config init, LLM features are automatically enabled (no flags needed)
- Use alternate config: npx analyze-qna --file qna.yaml --llm-config /path/to/config.yaml
- Set default via environment: export ANALYZE_QNA_CONFIG=/path/to/config.yaml
- Temporarily disable LLM: npx analyze-qna --file qna.yaml --no-llm

When LLM features are enabled, the tool will:

Show "[LLM]" indicators in output where LLM analysis is used
Add grounding validation results to the "A in Ctx" column
Display Q&A suggestions after each example that needs them
Automatically fetch and validate source documents from git repositories

Analysis Thresholds

You can provide a JSON config or override values via CLI.

JSON file (example config.json):

{
  "context_min": 320,
  "context_max": 520,
  "pair_min": 180,
  "pair_max": 320,
  "section_max": 800,
  "examples_min": 5,
  "examples_max": 15,
  "line_match_min_length": 30,
  "line_match_fraction_min": 0.85
}

CLI flags:
- --config config.json
- --context-range 320,520
- --pair-range 180,320
- --examples-range 5,15
- --section-max 800
- --line-match-min-length 30
- --line-match-fraction-min 0.85

Examples

Analyze single file:
- analyze-qna --file ./datasets/foo/qna.yaml
Agent-friendly JSON output:
- analyze-qna --file ./datasets/foo/qna.yaml --ai
Verify context against original source document:
- analyze-qna --file ./datasets/foo/qna.yaml --ai --source-doc ./datasets/foo/source.txt
Override thresholds on the fly:
- analyze-qna --taxonomy-root ./datasets --context-range 350,550 --pair-range 180,320 --section-max 800

Output

Human-readable table per file with per-pair breakout (Q/A tokens, totals, and whether they appear in context). A Notes section lists warnings (extra pairs ignored, out-of-range pairs, missing Q/A in context, context not matching source document).

When --yaml-lint is enabled, a YAML Lint section lists any formatting issues (trailing whitespace, missing final newline, CRLF, tabs/mixed indentation, duplicate keys).

Schema validation (InstructLab v3)

Validates knowledge QnA files against a bundled InstructLab v3 JSON Schema when the file path contains /knowledge/ (e.g., when analyzing a taxonomy tree).
Human mode prints a "Schema Validation" section with the failing path and a short hint from the schema when available.
AI mode adds a schema block in the JSON output with validated_against and detailed errors.

Bundled schemas (offline):

src/instructlab/schema/v3/knowledge.json
Upstream references: InstructLab schemas v3 and taxonomy layout
- https://github.com/instructlab/schema/tree/main/src/instructlab/schema/v3
- https://github.com/instructlab/taxonomy

Notes:

Requires jsonschema (already in requirements.txt); the Node wrapper installs it automatically.
If schema.validated_against is null, validation was skipped (non-knowledge path or schema not found).
Currently validates knowledge QnA. Other dataset types (e.g., compositional, foundational) receive lint checks; schema validation for those types can be added later.

Contributing

Contributions are welcome! Please read the guidelines in CONTRIBUTING.md and open an issue or pull request on GitHub.

Repo: https://github.com/rdwj/analyze-qna
Issues: https://github.com/rdwj/analyze-qna/issues
Maintainer: Wes Jackson ([email protected])

Development

Create and activate a venv, then install requirements:
- python -m venv .venv && source .venv/bin/activate
- pip install -r requirements.txt
Run locally via Node wrapper:
- node bin/analyze-qna.js --file path/to/qna.yaml
Linting/type hints: optional stubs included in requirements.txt (types-PyYAML, types-tabulate).

Publishing

Ensure executable bit on the Node bin
- chmod +x bin/analyze-qna.js
- git add --chmod=+x bin/analyze-qna.js
Bump version and dry run
- npm version patch
- FILE=$(npm pack --silent) && echo $FILE && tar -tf $FILE | cat
Login & publish
- npm login
- npm publish
Post-publish test
- npx analyze-qna --help

License

MIT. See LICENSE for details.

Acknowledgement

This utility is designed for use with InstructLab qna.yaml datasets and aims to mirror important validations to reduce failures during ilab taxonomy diff. InstructLab is an open-source project; please consult its documentation for canonical requirements and behavior.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

analyze-qna

Features

Installation

Usage

Configuration

LLM Configuration (New!)

Analysis Thresholds

Examples

Output

Schema validation (InstructLab v3)

Contributing

Development

Publishing

License

Acknowledgement