analyze-qna
v0.1.5
Published
Analyze Q&A YAML files and report token counts and structure quality
Readme
analyze-qna
CLI to analyze InstructLab qna.yaml files and report token counts, structure quality, and formatting lint. This tool is intended to help authors ensure their datasets pass InstructLab validations before running ilab taxonomy diff.
Features
- Token counting: Uses
tiktoken(OpenAI). - Rules enforced:
- Context ~500 tokens (default warn if outside 300–500)
- Each Q/A pair ~250 tokens (default warn if outside 200–300)
- Context + all pairs ≤ 750 tokens (error if over)
- 1-3 Q/A pairs per context (3 is optimal; extras ignored with warning)
- 5–15 sections required (warning otherwise)
- Optional checks: Q and A text present in context; context appears in source doc (when provided)
- LLM-powered analysis (NEW):
- Grounding checks: Validates that answers are actually based on the provided context
- Q&A suggestions: Automatically generates additional Q&A pairs when fewer than 3 exist
- Source verification: Fetches documents from git repos and validates context accuracy
- Supports Ollama and OpenAI-compatible endpoints (including vLLM)
- Configurable thresholds: via
--configor CLI flags (see below) - Stronger source matching: normalized substring + line-based fraction matching
- Directory crawl:
--taxonomy-rootcrawls a tree and analyzes files namedqna.yaml. - Readable report: Pretty table output via
tabulatewith per-pair breakout. - Agent mode:
--aiemits structured JSON for programmatic use. - YAML lint:
--yaml-lintchecks trailing whitespace, missing final newline, tabs/mixed indentation, CRLF endings, and duplicate keys.
Installation
Local development:
python -m venv .venv && source .venv/bin/activatepip install -r requirements.txt
After publish (or via npm link), use
npx:npx analyze-qna --help
Usage
Direct Python
python src/analyze_qna.py --file path/to/qna.yamlpython src/analyze_qna.py --file path/to/qna.yaml --aipython src/analyze_qna.py --file path/to/qna.yaml --ai --source-doc path/to/source.txtpython src/analyze_qna.py --taxonomy-root path/to/taxonomypython src/analyze_qna.py --taxonomy-root path/to/taxonomy --aipython src/analyze_qna.py --file path/to/qna.yaml --yaml-lintpython src/analyze_qna.py --taxonomy-root path/to/taxonomy --yaml-lintpython src/analyze_qna.py --data-dir path/to/dir(deprecated)
Via npx (after publishing or via npm link)
npx analyze-qna --file path/to/qna.yamlnpx analyze-qna --taxonomy-root path/to/taxonomynpx analyze-qna --file path/to/qna.yaml --yaml-lint
Configuration
LLM Configuration (New!)
analyze-qna now supports LLM-powered features including grounding checks and Q&A suggestions:
Interactive setup:
npx analyze-qna config init- Creates config at
~/.config/analyze-qna/config.yaml - Supports Ollama and OpenAI-compatible endpoints
- Configures features like grounding checks and Q&A suggestions
- Max tokens limit: up to 32,000 for larger models
- Creates config at
Validate configuration:
npx analyze-qna config validate [config-file]- Tests LLM connectivity and model availability
- Verifies API keys and endpoints
- Shows enabled features and thresholds
Use LLM features:
- After running
config init, LLM features are automatically enabled (no flags needed) - Use alternate config:
npx analyze-qna --file qna.yaml --llm-config /path/to/config.yaml - Set default via environment:
export ANALYZE_QNA_CONFIG=/path/to/config.yaml - Temporarily disable LLM:
npx analyze-qna --file qna.yaml --no-llm
- After running
When LLM features are enabled, the tool will:
- Show "[LLM]" indicators in output where LLM analysis is used
- Add grounding validation results to the "A in Ctx" column
- Display Q&A suggestions after each example that needs them
- Automatically fetch and validate source documents from git repositories
Analysis Thresholds
You can provide a JSON config or override values via CLI.
- JSON file (example
config.json):
{
"context_min": 320,
"context_max": 520,
"pair_min": 180,
"pair_max": 320,
"section_max": 800,
"examples_min": 5,
"examples_max": 15,
"line_match_min_length": 30,
"line_match_fraction_min": 0.85
}- CLI flags:
--config config.json--context-range 320,520--pair-range 180,320--examples-range 5,15--section-max 800--line-match-min-length 30--line-match-fraction-min 0.85
Examples
- Analyze single file:
analyze-qna --file ./datasets/foo/qna.yaml
- Agent-friendly JSON output:
analyze-qna --file ./datasets/foo/qna.yaml --ai
- Verify context against original source document:
analyze-qna --file ./datasets/foo/qna.yaml --ai --source-doc ./datasets/foo/source.txt
- Override thresholds on the fly:
analyze-qna --taxonomy-root ./datasets --context-range 350,550 --pair-range 180,320 --section-max 800
Output
Human-readable table per file with per-pair breakout (Q/A tokens, totals, and whether they appear in context). A Notes section lists warnings (extra pairs ignored, out-of-range pairs, missing Q/A in context, context not matching source document).
When --yaml-lint is enabled, a YAML Lint section lists any formatting issues (trailing whitespace, missing final newline, CRLF, tabs/mixed indentation, duplicate keys).
Schema validation (InstructLab v3)
- Validates knowledge QnA files against a bundled InstructLab v3 JSON Schema when the file path contains
/knowledge/(e.g., when analyzing a taxonomy tree). - Human mode prints a "Schema Validation" section with the failing path and a short hint from the schema when available.
- AI mode adds a
schemablock in the JSON output withvalidated_againstand detailed errors.
Bundled schemas (offline):
src/instructlab/schema/v3/knowledge.json- Upstream references: InstructLab schemas v3 and taxonomy layout
- https://github.com/instructlab/schema/tree/main/src/instructlab/schema/v3
- https://github.com/instructlab/taxonomy
Notes:
- Requires
jsonschema(already inrequirements.txt); the Node wrapper installs it automatically. - If
schema.validated_againstisnull, validation was skipped (non-knowledge path or schema not found). - Currently validates knowledge QnA. Other dataset types (e.g., compositional, foundational) receive lint checks; schema validation for those types can be added later.
Contributing
Contributions are welcome! Please read the guidelines in CONTRIBUTING.md and open an issue or pull request on GitHub.
- Repo:
https://github.com/rdwj/analyze-qna - Issues:
https://github.com/rdwj/analyze-qna/issues - Maintainer: Wes Jackson (
[email protected])
Development
Create and activate a venv, then install requirements:
python -m venv .venv && source .venv/bin/activatepip install -r requirements.txt
Run locally via Node wrapper:
node bin/analyze-qna.js --file path/to/qna.yaml
Linting/type hints: optional stubs included in
requirements.txt(types-PyYAML,types-tabulate).
Publishing
- Ensure executable bit on the Node bin
chmod +x bin/analyze-qna.jsgit add --chmod=+x bin/analyze-qna.js
- Bump version and dry run
npm version patchFILE=$(npm pack --silent) && echo $FILE && tar -tf $FILE | cat
- Login & publish
npm loginnpm publish
- Post-publish test
npx analyze-qna --help
License
MIT. See LICENSE for details.
Acknowledgement
This utility is designed for use with InstructLab qna.yaml datasets and aims to mirror important validations to reduce failures during ilab taxonomy diff. InstructLab is an open-source project; please consult its documentation for canonical requirements and behavior.
