@macive/lesson-plan-parser
v1.0.0
Published
Robust, flexible DOCX lesson plan parser that converts structured lesson plans into clean, database-ready JSON.
Maintainers
Readme
Lesson Plan Parser
A robust, flexible TypeScript Node.js application that parses DOCX lesson plan documents into clean, structured JSON suitable for database seeding.
Features
- DOCX-to-JSON conversion using
mammothfor reliable text extraction - HTML intermediate parsing to preserve table structures (Core Competencies, Values, PCIs)
- Flexible field extraction handles formatting variations across different curriculum files
- Auto-fix engine repairs common DOCX extraction errors (concatenated lines, missing newlines, malformed headers)
- Multi-lesson splitting accurately splits documents into individual lesson blocks
- Fully typed with TypeScript interfaces for all parsed entities
- CLI & programmatic API use via command line or import into your own code
Installation
npm installUsage
CLI
# Parse single file
npx ts-node src/cli.ts path/to/lesson-plans.docx -o ./output
# Parse multiple files
npx ts-node src/cli.ts plan1.docx plan2.docx plan3.docx -o ./json-output
# With options
npx ts-node src/cli.ts plans.docx -o ./output --raw --prettyCLI Options:
-o, --output <dir>— Output directory (default:./output)-r, --raw— Include raw extracted text in JSON-p, --pretty— Pretty-print JSON output-m, --min-length <n>— Minimum lesson block length filter
Programmatic API
import { parseDocxFile, parseDocxFiles } from "./index";
// Single file
const result = await parseDocxFile("path/to/plans.docx");
console.log(result.document?.lessons);
// Multiple files
const results = await parseDocxFiles([
{ filePath: "plan1.docx", outputPath: "out1.json" },
{ filePath: "plan2.docx", outputPath: "out2.json" },
]);JSON Output Structure
{
"metadata": {
"sourceFile": "string",
"term": 1,
"level": "GRADE 8",
"learningArea": "AGRICULTURE AND NUTRITION",
"grade": "Grade 8"
},
"lessons": [
{
"weekLessonLabel": "WEEK 1: LESSON 1",
"weekNumber": 1,
"lessonNumber": "1",
"isCombinedLesson": false,
"strand": "Hygiene Practices",
"subStrand": "Cleaning practices",
"learningOutcomes": {
"preamble": "By the end of the lesson, the learner should be able to:",
"items": ["Identify appropriate procedures...", "Explain the routine...", "Appreciate a clean..."]
},
"keyInquiryQuestions": ["How can we...?"],
"coreCompetencies": ["Learning to learn", "Digital literacy"],
"values": ["Integrity", "Responsibility"],
"pcis": ["Health promotion", "Safety"],
"learningResources": ["Agriculture and Nutrition grade 8..."],
"organizationOfLearning": {
"introduction": { "duration": "5 minutes", "content": ["Review..."] },
"lessonDevelopment": {
"duration": "30 minutes",
"steps": [
{ "stepNumber": 1, "title": "Daily Cleaning Practices", "content": ["Discuss..."] }
]
},
"conclusion": { "duration": "5 minutes", "content": ["Summarize..."] }
},
"extendedActivities": ["Conducting a kitchen cleaning..."],
"teacherSelfEvaluation": ""
}
]
}Supported Document Variations
The parser is designed to be resilient against different formatting styles:
- Header tables with varying column orders and empty cells
- Spelling variations: "Organization" vs "Organisation", "Sub Strand" vs "Sub-Strand"
- Key Inquiry Questions with optional
(s)suffix - Numbered lists with various prefixes (
1.,1.Define,1**.**) - Combined lessons like
WEEK 6: LESSON 1 - 2 - Missing or empty teacher self-evaluation sections
- Tables vs vertical lists for competencies/values/pcis
- Malformed first lessons with concatenated fields
Project Structure
src/
types/ # TypeScript interfaces
extractors/ # DOCX text extraction
parsers/ # Core parsing engine
utils/ # Text normalization utilities
index.ts # Public API
cli.ts # CLI entry pointBuilding
npm run buildCompiled JavaScript will be in the dist/ directory.
License
MIT
