erp-datagen

v0.4.3

Published

2 months ago

Generate realistic synthetic procurement data for SAP ECC, D365, and JDE — for testing, AI training, and demos

0High
0Medium
0Low

kundan1975

sap erp procurement synthetic-data jde d365 test-data ai-training

erp-datagen

npm

Generate realistic synthetic procurement data for SAP ECC, Microsoft D365, and JDE — for testing, AI training, and demos.

The problem

Developers and data engineers building on ERP systems constantly need realistic test data. Real production data cannot be shared. Existing tools do not understand procurement domain structure — duplicate vendor names, multi-currency POs, three-way match scenarios, credit memos, invoice reversals, and the messy edge cases that actually matter.

This tool does.

Supported ERP systems

| ERP | Tables generated | |-----|-----------------| | SAP ECC | LFA1, LFB1, LFM1, EKKO, EKPO, MKPF, MSEG, RBKP, RSEG, BSAK, BSIK, EKBE, BSET, BKPF, BSEG, COEP, REGUH, REGUP, MARA, MAKT, ESSR, ESLL (22 tables) | | D365 F&O | VendTable, PurchTable, PurchLine, VendPackingSlipJour, VendInvoiceJour, VendInvoiceTrans, VendTrans (7 tables) | | JDE E1 | F0101, F4301, F4311, F43121, F0411, F0413 (6 tables) |

Requirements

Node.js >= 18

Install

# Run without installing
npx erp-datagen --help

# Or install globally
npm install -g erp-datagen

# Or clone and run locally
git clone https://github.com/kundanshar-cell/erp-datagen.git
cd erp-datagen
npm install

Two ways to use it

1. Generate a single table

Use this when you need one specific table in isolation — vendors, PO headers, invoices, etc.

npx erp-datagen generate --erp=<erp> --entity=<entity> [options]

| Option | Description | Default | |---|---|---| | --erp | ERP system: sap-ecc, jde, d365 | required | | --entity | Table to generate (see list below) | required | | --rows | Number of rows | 100 | | --output | Format: csv, json, jsonl | csv | | --file | Write to file instead of stdout | stdout | | --missing-rate | Proportion of optional fields left blank (0–1) | 0 | | --seed | Random seed for reproducible output | none |

Examples:

# 500 SAP vendors as CSV
npx erp-datagen generate --erp=sap-ecc --entity=vendors --rows=500

# 200 SAP PO headers as JSON, written to file
npx erp-datagen generate --erp=sap-ecc --entity=po-headers --rows=200 --output=json --file=./ekko.json

# JDE PO lines with 20% missing fields (simulates messy source data)
npx erp-datagen generate --erp=jde --entity=po-lines --rows=100 --missing-rate=0.2

# Reproducible output — same seed always produces same data
npx erp-datagen generate --erp=sap-ecc --entity=vendors --rows=100 --seed=42

Supported entities per ERP

| ERP | --entity value | Table | |-----|-----------------|-------| | sap-ecc | vendors | LFA1 | | sap-ecc | po-headers | EKKO | | sap-ecc | po-lines | EKPO | | sap-ecc | gr-headers | MKPF | | sap-ecc | gr-lines | MSEG | | sap-ecc | invoice-headers | RBKP | | sap-ecc | invoice-lines | RSEG | | jde | vendors | F0101 | | jde | po-headers | F4301 | | jde | po-lines | F4311 | | jde | gr-lines | F43121 | | jde | invoices | F0411 | | d365 | vendors | VendTable | | d365 | po-headers | PurchTable | | d365 | po-lines | PurchLine | | d365 | gr-headers | VendPackingSlipJour | | d365 | invoice-headers | VendInvoiceJour | | d365 | invoice-lines | VendInvoiceTrans |

2. Generate a full linked dataset (scenarios)

Scenarios generate all tables for an ERP linked by real document keys — vendor → PO → goods receipt → invoice → payment — in one command. One file per table, written to an output directory.

npx erp-datagen scenario --erp=<erp> --name=<scenario> [options]

| Option | Description | Default | |---|---|---| | --erp | ERP system: sap-ecc, jde, d365 | required | | --name | Scenario name (see below) | required | | --rows | Approximate number of PO lines to anchor the dataset | 100 | | --output | Format: csv, json, jsonl | csv | | --output-dir | Directory to write files | ./output | | --missing-rate | Proportion of optional fields left blank (0–1) | 0 | | --seed | Random seed for reproducible output | none |

Scenarios

full-p2p

Generates the complete procure-to-pay chain for any ERP. All tables are linked by real document keys — the same referential integrity you would find in a live system.

# SAP ECC — 1000 PO lines, all 22 tables, CSV
npx erp-datagen scenario --erp=sap-ecc --name=full-p2p --rows=1000 --output-dir=./output

# JDE — full P2P as JSON
npx erp-datagen scenario --erp=jde --name=full-p2p --rows=500 --output=json --output-dir=./output/jde

# D365 — messy data with 30% missing fields
npx erp-datagen scenario --erp=d365 --name=full-p2p --rows=1000 --missing-rate=0.3 --output-dir=./output/d365

What gets generated (SAP ECC, 1000 rows):

LFA1_vendors.csv              ~100 rows   Vendor master
LFB1_vendor_company.csv       ~100 rows   Vendor per company code
LFM1_vendor_purchasing.csv    ~100 rows   Vendor purchasing data
EKKO_po_headers.csv           ~200 rows   Purchase order headers
EKPO_po_lines.csv             ~1000 rows  Purchase order lines
MKPF_gr_headers.csv           ~140 rows   Goods receipt headers
MSEG_gr_lines.csv             ~520 rows   Goods receipt lines
RBKP_invoice_headers.csv      ~155 rows   Invoice headers
RSEG_invoice_lines.csv        ~570 rows   Invoice lines
BSAK_cleared_items.csv        ~108 rows   Cleared AP items (paid)
BSIK_open_items.csv           ~47 rows    Open AP items (unpaid)
EKBE_po_history.csv           ~690 rows   PO history (GR + invoice events)
BSET_tax_lines.csv            ~120 rows   Tax document lines
BKPF_fi_headers.csv           ~400 rows   FI document headers
BSEG_fi_lines.csv             ~900 rows   FI document lines
COEP_cost_lines.csv           ~280 rows   CO cost elements
REGUH_payment_runs.csv        ~80 rows    Payment run headers
REGUP_payment_items.csv       ~108 rows   Payment run items
MARA_material_master.csv      ~300 rows   Material master
MAKT_material_desc.csv        ~300 rows   Material descriptions
ESSR_service_sheets.csv       ~40 rows    Service entry sheets
ESLL_service_lines.csv        ~80 rows    Service line items

spend-cube (SAP ECC only)

Generates the same 22 SAP ECC tables as full-p2p, but with two additions designed for spend analytics training:

Every invoice and GR row carries a SCENARIO label — so downstream models can learn to classify spend types
Six company codes with deliberate spend profiles — each company has a fixed PO vs non-PO ratio to represent different procurement maturity levels

# SAP ECC spend cube — 500 rows
npx erp-datagen scenario --erp=sap-ecc --name=spend-cube --rows=500 --output=json --output-dir=./output/spend

Company spend profiles:

| Company | Country | Currency | PO% | Non-PO% | Story | |---------|---------|----------|-----|---------|-------| | 1000 | Germany | EUR | 85% | 15% | Mature, SAP-native procurement | | GB01 | UK | GBP | 35% | 65% | Maverick spend — the problem company | | 2000 | USA | USD | 50% | 50% | Transitioning, partial compliance | | US01 | USA | USD | 60% | 40% | Compliance improving | | IN01 | India | INR | 80% | 20% | High PO discipline | | 3000 | Europe | EUR | 75% | 25% | Regional shared services centre |

SCENARIO labels on every row:

| SCENARIO | What it represents | |---|---| | PO_NORMAL | Standard PO → GR → Invoice → Payment | | PO_SERVICE | Service PO — no goods receipt, ESSR/ESLL instead | | PO_FRAMEWORK | Framework order drawdown | | PO_CONSIGNMENT | Consignment settlement | | NON_PO_STANDARD | Non-PO invoice — rent, utilities, subscriptions | | NON_PO_CREDIT | Credit memo against a non-PO invoice | | CREDIT_MEMO | KG — price dispute, quality claim, returns | | DEBIT_MEMO | KA — vendor underbilled, price corrected upward | | INVOICE_REVERSAL | Invoice cancelled and reposted (wrong vendor or amount) | | SUBSEQUENT_CREDIT | Price corrected down after invoice was posted | | SUBSEQUENT_DEBIT | Price corrected up after invoice was posted | | SPLIT_INVOICE | One PO line invoiced across two separate invoices | | GR_REVERSAL | Movement 102 — wrong delivery returned to stock | | RETURN_TO_VENDOR | Movement 122 — physical return, vendor credit expected | | PARTIAL_GR | Goods receipt for less than the PO quantity | | PO_LINE_CANCELLED | PO line cancelled (LOEKZ=L) — committed spend removed |

Edge cases included

Duplicate vendor names with different formats (ACME Ltd, Acme Limited, ACME LIMITED) — for dedup testing
Configurable missing fields (--missing-rate) to simulate messy source data
Multi-currency: GBP, USD, EUR, INR, SGD, JPY — with realistic exchange rates
Multi-language vendor names (English, German, French, Japanese, Hindi)
Realistic document numbering per ERP convention
Three-way match (PO → GR → Invoice) with referential integrity enforced
Partial deliveries — GR quantity less than PO quantity
Invoice price variance against PO price
Credit memos, debit memos, invoice reversals
Subsequent credits and debits referencing original invoices
GR reversals (movement 102) and returns to vendor (movement 122)
Service lines — 2-way match only (no goods receipt)
Blocked and deletion-flagged vendors
Cancelled PO lines
Realistic VAT/tax amounts per country

Output formats

| Format | Flag | Use case | |--------|------|----------| | CSV | --output=csv | Excel, database import, BI tools | | JSON | --output=json | APIs, application testing | | JSONL | --output=jsonl | AI/ML training pipelines, streaming |

SQL and Parquet formats are on the roadmap.

Roadmap

[x] SAP ECC — 22 linked tables
[x] JDE E1 — 6 linked tables
[x] D365 F&O — 7 linked tables
[x] full-p2p scenario — all ERPs
[x] spend-cube scenario — SAP ECC with 16 spend scenario labels
[x] JSONL output for AI training pipelines
[x] Reproducible output with --seed
[ ] spend-cube scenario for JDE and D365
[ ] SQL and Parquet output formats
[ ] Oracle Fusion Procurement
[ ] Coupa supplier and PO entities
[ ] Web UI for no-code data generation

Who is this for

Developers building ERP integrations who need realistic test data without touching production
Data engineers building procurement analytics pipelines
AI/ML teams training models on procurement data — classification, extraction, dedup
Consultants demoing ERP tools without production data

Contributing

PRs welcome — especially for Oracle Fusion, Coupa, and Ariba schemas. See CONTRIBUTING.md for guidelines.

Author

Built by Kundan Sharma — IT & Digital Solution Architect specialising in procurement data transformation and agentic AI in enterprise supply chains.

15+ years designing and delivering digital transformation programmes across enterprise.

GitHub

If this saved you time, leave a star.

License

MIT — see LICENSE for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

erp-datagen

The problem

Supported ERP systems

Requirements

Install

Two ways to use it

1. Generate a single table

Supported entities per ERP

2. Generate a full linked dataset (scenarios)

Scenarios

full-p2p

spend-cube (SAP ECC only)

Edge cases included

Output formats

Roadmap

Who is this for

Contributing

Author

License