@akhyar11/ml-v1

v2.3.0

Published

4 days ago

TypeScript + Rust Native machine learning library. Matrix ops, layers (Dense, Embedding, RNN, LSTM, GRU, MultiHeadAttention, etc), models (Sequential, Transformers), dan BPE tokenizer.

0High
0Medium
0Low

akhyar11

machine-learning ml neural-network transformer matrix deep-learning typescript rust napi bpe-tokenizer sequential lstm gru rnn attention

ML-V1

A TypeScript + Rust Native machine learning library — Matrix operations, neural network layers, Transformer models, and a BPE tokenizer, all in one package.

What is ML-V1?

ML-V1 is a low-to-mid-level machine learning library built with TypeScript and accelerated by a Rust native backend (via napi-rs). It gives you full control over every detail of the training loop — shapes, parameter updates, and custom architectures — without depending on a large ML framework.

Why ML-V1?

Full manual control over training loops, tensor shapes, and parameter updates.
A research playground for custom model architectures.
The productivity of TypeScript combined with Rust performance on hot paths.
Graceful fallback to pure JavaScript when the native backend is unavailable.

Features

Matrix — flat Float32Array-backed tensor with zero-copy hot-path access via _data.
Math primitives — dotProduct, add, sub, sumAxis, clipGradients, and more; automatically dispatched to Rust or JS.
Layers — Dense, Embedding, RNN, LSTM, GRU, SelfAttention, MultiHeadAttention, LayerNormalization, Dropout, PositionalEncoding, Flatten, Convolution.
Models — Sequential, Transformers (causal LM), DimentionalityReduction.
BPE Tokenizer — train, incremental update, Unicode-aware pre-tokenization, encode/decode with special tokens, padding, and JSON save/load.
Rust-accelerated ops — dot-product, activations, LayerNorm, embedding lookup, attention, and optimizer updates; auto-fallback to JS when unavailable.
Dynamic padding trim (trimPadding) — reduces effective sequence length per batch, cutting attention cost from O(seqLen²) to O(effectiveSeqLen²).

Installation

npm install @akhyar11/ml-v1

Prerequisites for Native Acceleration

The library works out of the box with a pure JavaScript fallback. For up to 10× faster matrix operations, install the Rust toolchain so the native addon can be compiled automatically on npm install:

Rust Toolchain — install via rustup.rs.
C/C++ Build Tools — required by the native binding compiler (e.g. build-essential on Linux, Xcode CLI on macOS, MSVC on Windows).

Note: If Rust is not installed, a warning is printed and the library falls back to pure JavaScript automatically. Performance will be noticeably slower for large models.

Building from Source

If you cloned the repository or need a manual build:

# Install dependencies
npm install

# Build the native Rust addon (release mode)
npm run build:rust

# Build the TypeScript distribution
npm run build:publish

Rust Native Backend

The native backend is loaded by src/math/rust_backend.ts. You can check whether it is active at runtime:

import { isNativeAvailable } from "@akhyar11/ml-v1";
console.log("Native active:", isNativeAvailable());

To force JavaScript-only execution (useful for debugging or regression comparisons):

ML_DISABLE_NATIVE=1 node your-script.js

Quick Start

Train a simple XOR classifier in a few lines:

import { Dense, mj, Sequential } from "@akhyar11/ml-v1";

const model = new Sequential({
  layers: [
    new Dense({ units: 2, outputUnits: 4, activation: "relu", status: "input" }),
    new Dense({ units: 4, outputUnits: 1, activation: "sigmoid", status: "output", loss: "mse" }),
  ],
});

model.compile({ alpha: 0.01, optimizer: "adam", error: "mse" });

const X = [mj.matrix([[0], [0]]), mj.matrix([[0], [1]]), mj.matrix([[1], [0]]), mj.matrix([[1], [1]])];
const Y = [mj.matrix([[0]]), mj.matrix([[1]]), mj.matrix([[1]]), mj.matrix([[0]])];

const result = model.fit(X, Y, 200, {
  batchSize: 4,
  validationSplit: 0.25,
  earlyStoppingPatience: 10,
  verbose: true,
  onEpochEnd: (epoch, loss, valLoss) => {
    console.log(`epoch=${epoch} loss=${loss} valLoss=${valLoss}`);
  },
});
console.log("best", result.bestEpoch, result.bestLoss);
const pred = model.predict(mj.matrix([[1], [0]]));
pred.print();

A legacy callback overload is also supported for backward compatibility:

model.fit(X, Y, 200, (loss) => console.log("loss", loss));

Examples

Matrix & Math Operations

import { mj } from "@akhyar11/ml-v1";

const a = mj.matrix([[1, 2], [3, 4]]);
const b = mj.matrix([[5, 6], [7, 8]]);
const c = mj.dotProduct(a, b);
const d = mj.add(c, 1);
console.log(c._shape, d._shape);

BPE Tokenizer

import { BPETokenizer } from "@akhyar11/ml-v1";

const tokenizer = new BPETokenizer({ vocabSize: 120, minFrequency: 2 });
tokenizer.train(["hello world", "hello there"]);
const ids = tokenizer.encodeWithSpecial("hello world");
const padded = tokenizer.padSequence(ids, 12);
console.log(ids, padded, tokenizer.decode(ids));

Unicode and Multilingual Tokenization

ML-V1 supports custom and built-in pre-tokenizers for non-Latin text. The default is still "char" for backward compatibility; use "unicode-grapheme" or "script-aware" for multilingual corpora.

Supported modes:

char
unicode-grapheme
unicode-word
whitespace
script-aware

import { BPETokenizer } from "@akhyar11/ml-v1";

const tokenizer = new BPETokenizer({
  vocabSize: 1000,
  preTokenizer: "script-aware"
});

tokenizer.train([
  "hello world",
  "مرحبا بالعالم",
  "こんにちは世界",
  "你好世界",
  "ภาษาไทย",
  "한국어테스트",
  "ꦱꦺꦴꦥꦺꦴ",
  "x² + y² = z²",
  "hello ꦱꦺꦴꦥꦺꦴ 😊 你好"
]);

BPE alone is not enough for every writing system. Pre-tokenization is important for scripts without spaces, combining marks, emoji sequences, and mixed text. script-aware is a general built-in mode; for language-specific behavior, pass a custom (text: string) => string[] pre-tokenizer. Intl.Segmenter improves grapheme and word segmentation when the runtime supports it. Fallback behavior is deterministic but may be less linguistically accurate.

Transformer Causal LM — Training

import { mj, Transformers } from "@akhyar11/ml-v1";

const model = new Transformers({ units: 64, seqLen: 8, vocabSize: 500, heads: 8, alpha: 0.001, padTokenId: 0 });
model.compile({ alpha: 0.001, optimizer: "adam", error: "softmaxCrossEntropy" });
model.train();

const x = mj.matrix([[0], [0], [10], [20], [30], [40], [50], [60]]); // shape [seqLen, 1]
const y = mj.matrix([[0], [10], [20], [30], [40], [50], [60], [0]]); // shifted targets [seqLen, 1]

const logits = model.forward(x); // shape [vocabSize, seqLen * batch]
model.backward(y);
console.log("shape", logits._shape, "loss", model.loss);

Transformer — Generation / Inference

import { mj, Transformers } from "@akhyar11/ml-v1";

const model = new Transformers({
  units: 64,
  seqLen: 8,
  vocabSize: 500,
  heads: 8,
  alpha: 0.001,
  padTokenId: 0,
  predictMode: "next-token",
});
model.eval();

const x = mj.matrix([[0], [0], [10], [20], [30], [40], [50], [60]]);
const nextTokenLogits = model.predict(x); // shape [vocabSize, batch]
model.setPredictMode("full-sequence");
const fullSequenceLogits = model.predict(x); // shape [vocabSize, seqLen * batch]

API Overview

Models

| Model | Description | |---|---| | Sequential | Generic layer stack (Dense, Embedding, Attention, CNN, etc.). | | Transformers | Multi-block causal language model. Supports numBlocks >= 1, full-sequence training, and configurable predictMode ("next-token" / "full-sequence"). | | DimentionalityReduction | Extends Sequential with an encoder/decoder split via the outputReduction layer status. |

Layers

| Layer | Description | |---|---| | Dense | Fully-connected layer with activation, optimizer, and loss handling. | | Embedding | Token-ID-to-vector lookup with resize() support. | | LayerNormalization | Per-column/token normalization. | | Dropout | Active only during training mode. | | PositionalEncoding | Fixed sinusoidal positional encoding. | | MultiHeadAttention / SelfAttention | Causal attention mask with padding support. | | RNN / LSTM / GRU | Recurrent sequence modeling with BPTT, gradient clipping, save/load, and stateful mode. returnSequences is supported; returnState is not yet supported and will throw explicitly. | | Flatten / Convolution | Standard CNN building blocks. |

Tokenizer

BPETokenizer supports:

| Method | Description | |---|---| | train(corpus) | Initial BPE training on a string array. | | update(corpus) | Incremental vocabulary update without retraining from scratch. | | encode(text) / encodeWithSpecial(text) | Encode text to token IDs, with or without special tokens. | | decode(ids) | Convert token IDs back to text. | | padSequence(ids, length) | Pad or truncate a sequence to a fixed length. | | save(path) / load(path) | Persist and restore the tokenizer as a JSON file. |

Tokenizer options:

type PreTokenizer = (text: string) => string[];

type BuiltInPreTokenizer =
  | "char"
  | "unicode-grapheme"
  | "unicode-word"
  | "whitespace"
  | "script-aware";

type BPETokenizerOptions = {
  vocabSize?: number;
  minFrequency?: number;
  preTokenizer?: BuiltInPreTokenizer | PreTokenizer;
};

Built-in pre-tokenizer names are saved in tokenizer JSON files. Custom pre-tokenizer functions are not serialized; saved metadata records "custom", and the same function must be passed again to BPETokenizer.load(path, { preTokenizer }).

Core Concepts

Shape convention: most layers use [rows, cols]; batched Transformer inputs use column-sequence layout [seqLen, batchSize].
Recurrent convention: recurrent layers expect a single sequence sample with shape [features, seqLen]. The generic Sequential.fit() does not batch recurrent sequences yet — use batchSize: 1.
Sparse classification targets: use softmaxCrossEntropy with a dense output layer and a target of shape [1, batch] containing class indices.
Training / eval mode: call model.train() before training and model.eval() before inference. Layers like Dropout respect this flag.

Training Workflow

Prepare data as Matrix inputs and targets.
Build your model and add layers.
Call model.compile({ alpha, optimizer, error }).
Run model.fit() (high-level) or loop forward() → backward() manually.
Save the model and tokenizer with save().

Inference Workflow

Load the model and tokenizer.
Convert input text to token IDs and pad to seqLen.
Call model.predict() (respects predictMode for Transformers) or model.forward().
Extract the argmax or raw logits as required by your task.
Decode token IDs back to text for NLP tasks.

Performance Notes

The Rust backend accelerates dot-product, activations, LayerNorm, embedding lookup, attention, and optimizer hot paths.
Matrix uses Float32Array to minimize allocation overhead. Use _data directly in hot paths.
Several layers use pre-allocated output buffers to reduce garbage collection pressure.
Dynamic padding trim (trimPadding: true, the default) reduces effectiveSeqLen per batch, cutting attention cost from O(seqLen²) to O(effectiveSeqLen²) and output projection cost from vocabSize × seqLen × batch to vocabSize × effectiveSeqLen × batch.

Dynamic Padding Trim (v2.2.0+)

When training a Transformer on long-context sequences (e.g. seqLen=1024), enable trimPadding to avoid paying the full quadratic attention cost on padding tokens:

import { Transformers } from "@akhyar11/ml-v1";

const model = new Transformers({
  units: 64,
  seqLen: 1024,
  vocabSize: 5000,
  heads: 8,
  numBlocks: 2,
  padTokenId: 0
});

// Right-padding (recommended for new datasets)
model.fit(trainX, trainY, 80, {
  batchSize: 8,
  trimPadding: true,
  paddingSide: "right",
  shuffle: true
});

// Left-padding (for datasets already padded on the left)
model.fit(trainX, trainY, 80, {
  batchSize: 8,
  trimPadding: true,
  paddingSide: "left",
  shuffle: true
});

Options:

trimPadding: true (default) — enabled automatically.
paddingSide: "right" (default) — trailing PAD tokens are trimmed; positionOffset is 0.
paddingSide: "left" — leading PAD tokens are trimmed; positionOffset is adjusted so that positional encodings for real tokens remain unchanged.
trimPadding: false — disables the feature entirely.
Only applies to full-sequence targets with shape [seqLen, batch]. Legacy targets with shape [1, batch] are not trimmed.

Best Practices

Use softmaxCrossEntropy for sparse token classification tasks.
Keep seqLen consistent between your preprocessing pipeline and the model constructor.
Set padTokenId in both the tokenizer and the model's Embedding layer.
For Transformers, prepare shifted next-token targets with shape [seqLen, batch] and fill invalid positions with padTokenId.
Call model.train() before training and model.eval() before inference.
For Transformer inference, use model.predict() as the primary entry point and set predictMode to "next-token" or "full-sequence" as needed.
For stateful recurrent models, avoid shuffle: true and validationSplit > 0 in the generic Sequential.fit() loop.
Start debugging with ML_DISABLE_NATIVE=1 when comparing JS vs. native behavior.
If loss does not decrease, verify tensor shapes at every layer boundary.

Troubleshooting

| Problem | Solution | |---|---| | Native backend not available | Run npm run build:rust, or verify that the .node binary matches your current platform. | | Shape mismatch in dot product | Check that dimensions satisfy [aRows × aCols] · [bRows × bCols] where aCols === bRows. | | Loss is NaN or Inf | Reduce the learning rate alpha, verify target format, and check for out-of-range token IDs in the embedding. |

Project Structure

src/
  activation/   cost/   optimizer/
  matrix/        math/
  layers/        models/
  tokenizer/
  utils/
src-rust/
  src/lib.rs       ← Rust native ops (napi-rs)
test/
dataset/
docs/

Architecture Overview

| Module | Role | |---|---| | src/matrix | Core Matrix data structure (Float32Array-backed). | | src/math | Numeric primitives + adaptive Rust/JS dispatch. | | src/activation, src/cost, src/optimizer | Training building blocks. | | src/layers | Neural network layer implementations. | | src/models | High-level model compositions. | | src/tokenizer | Text preprocessing (BPE). | | src-rust | Native ops compiled via napi-rs. |

Benchmark & Testing

Full test + benchmark entry point: test/index.ts — run with npm test.
Correctness suite: test/correctness/index.ts.
Synthetic benchmark suite: test/benchmark/index.ts.
Recurrent model benchmarks: test/benchmark/testFamilyRnn.test.ts.
Transformer mode benchmarks: test/benchmark/testFamilyTransformers.test.ts.
Benchmark history: docs/benchmark-sintetis/README.md.
Correctness history: docs/correctness/README.md.

📖 Documentation

For in-depth guides, see the official documentation:

Overview & Philosophy — Introduction to the library design and system architecture.
Installation & Setup — How to install and enable Rust native acceleration.
Practical Tutorial — Step-by-step guide to building a logic bot and a generative (GPT-style) bot.
Full API Reference — Technical documentation for Matrix, Math, Layers, Tokenizer, Optimizers, and related APIs.

Versioning

This project follows MAJOR.MINOR.PATCH semantic versioning. The current version is 2.2.8.

MAJOR — breaking changes or major architectural shifts.
MINOR — new backward-compatible features or improvements.
PATCH — bug fixes, small optimizations, or minor internal changes.

Recent changelog:

| Version | Summary | |---|---| | 2.2.8 | Full Native Optimizer support (Adam, SGD, AdaGrad, Momentum, NAG) and Sparse Embedding native backend. | | 2.2.7 | Unicode-aware BPE pre-tokenizers and multilingual tokenizer documentation. | | 2.2.5 | Hot-path optimizations for training/validation, embedding lookup, and BPE tokenizer. | | 2.2.4 | Transformers.predictMode API ergonomics, docs sync, and correctness suite refactor. | | 2.2.3 | Training/inference hot-path optimizations and updated correctness learning snapshots. | | 2.2.2 | Combined root suite, family model benchmarks, and correctness learning snapshots. | | 2.2.0 | Dynamic padding trim + positional encoding offset. | | 2.0.2 | Transformer projector optimizations with no API changes. |

Development & Contributing

npm install          # install dependencies
npm run build:rust   # compile the Rust native addon
npm test             # run the correctness suite + synthetic benchmark

Type-check only (no emit):

npx tsc --noEmit

Roadmap

Stabilize public API entry points (currently imported directly from src/*).
Add deterministic floating-point tests.
Clean up scripts that reference non-existent project folders.
Add dataset recipe documentation and benchmark workflow guides.

License & Credits

License: ISC — see package.json.
Native backend: napi-rs, matrixmultiply, rayon.
Issues & feature requests: use the GitHub issue tracker.
Support the project: