@zakkster/lite-batch-buffer
v1.0.0
Published
Pre-allocated, zero-GC interleaved vertex buffer for WebGL 1/2 and WebGPU sprite/tile/quad batchers. One allocation for the lifetime of the renderer.
Maintainers
Readme
@zakkster/lite-batch-buffer
Pre-allocated, zero-GC interleaved vertex buffer for WebGL 1 / 2 and WebGPU sprite / tile / quad batchers.
One allocation for the lifetime of the renderer. No per-frame new Float32Array. No {x, y, u, v} object graphs. No garbage-collection pauses in your draw loop.
import { BatchBuffer } from '@zakkster/lite-batch-buffer';
const vb = new BatchBuffer({
maxVertices: 40_000,
layout: [
{ name: 'pos', type: 'f32', size: 2 },
{ name: 'uv', type: 'f32', size: 2 },
{ name: 'color', type: 'u32', size: 1 }, // packed RGBA
],
});
// Hoist views + offsets ONCE, use them for the lifetime of the renderer.
const f32 = vb.f32, u32 = vb.u32;
const s = vb.strideF32;
const P = vb.offsetF32('pos'), U = vb.offsetF32('uv'), C = vb.offsetU32('color');
// Per frame:
vb.reset();
vb.ensureCapacity(quadCount * 6);
let n = vb.count; // hoist to local for the hot loop
for (let i = 0; i < quadCount; i++) {
const o = n * s;
f32[o + P] = x; f32[o + P + 1] = y;
f32[o + U] = u; f32[o + U + 1] = v;
u32[o + C] = BatchBuffer.packRGBA(255, 255, 255, 255);
n++;
// ... 5 more verts for the quad ...
}
vb.count = n;
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vb.u8, 0, vb.byteLength);Contents
- Why · Install · Quick start
- How it works
- Case study: a Tiled tilemap renderer
- API reference
- Benchmarks
- Testing (for clients & QA)
- Running the demo
- Browser & engine compatibility
- Edge cases & guarantees
- FAQ · License
Why
JavaScript graphics code has a distinctive failure mode: per-frame allocation. It looks like this, and it's what you write first:
// The code you write first, and regret later
function drawFrame() {
const verts = [];
for (const tile of visibleTiles) {
verts.push({ x: tile.x, y: tile.y, u: tile.u, v: tile.v, color: tile.color });
// ... 5 more per quad
}
const flat = new Float32Array(verts.length * 5);
for (let i = 0; i < verts.length; i++) { /* flatten */ }
gl.bufferData(gl.ARRAY_BUFFER, flat, gl.STREAM_DRAW);
}Each frame this produces tens of thousands of short-lived objects and a fresh ArrayBuffer. The math is cheap — the allocation is what destroys your frame budget. Major GC pauses turn smooth 60 fps into periodic 30 ms stutters.
flowchart LR
subgraph N["Naive path"]
direction TB
N1[per-frame allocation<br/>new Array / new TypedArray]
N2[populate objects or slots]
N3[flatten to TypedArray]
N4[upload]
N5[objects + buffer<br/>become garbage]
N1 --> N2 --> N3 --> N4 --> N5 -.->|GC pressure<br/>frame stalls| N1
end
subgraph B["BatchBuffer path"]
direction TB
B0[one allocation<br/>at renderer init]
B1[reset count = 0]
B2[write into views<br/>indexed stores only]
B3[upload vb.u8<br/>zero alloc]
B0 -.->|reused forever| B1
B1 --> B2 --> B3 -.->|no garbage| B1
end@zakkster/lite-batch-buffer owns the pre-allocated buffer and exposes every typed-array view over it (f32, u32, u16, u8, …) so the vertex-emit code stays a plain indexed-store loop. Nothing fancy. That's the point.
What this is not
- Not a renderer. It doesn't know about WebGL contexts, shaders, or draw calls.
- Not a scene graph. No sprites, no transforms, no culling.
- Not magic. A hand-rolled
Float32Arrayyou manage yourself is ~2× faster (see benchmarks). This library trades that for layout hygiene + endian-correct color packing + capacity management in ~120 lines of code.
Install
npm i @zakkster/lite-batch-bufferESM-only. No dependencies. Ships TypeScript definitions alongside the source.
import { BatchBuffer } from '@zakkster/lite-batch-buffer';
// or: import BatchBuffer from '@zakkster/lite-batch-buffer';You can also drop src/index.js into your project directly — it's one file.
Quick start
const vb = new BatchBuffer({
maxVertices: 4096,
layout: [
{ name: 'pos', type: 'f32', size: 2 }, // vec2 position
{ name: 'uv', type: 'f32', size: 2 }, // vec2 UV
{ name: 'color', type: 'u32', size: 1 }, // packed RGBA, 1 u32
],
});
// WebGL 2 VAO setup (done once):
const vbo = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
gl.bufferData(gl.ARRAY_BUFFER, vb.arrayBuffer.byteLength, gl.DYNAMIC_DRAW);
gl.enableVertexAttribArray(0);
gl.vertexAttribPointer(0, 2, gl.FLOAT, false, vb.stride, 0);
gl.enableVertexAttribArray(1);
gl.vertexAttribPointer(1, 2, gl.FLOAT, false, vb.stride, 8);
gl.enableVertexAttribArray(2);
gl.vertexAttribPointer(2, 4, gl.UNSIGNED_BYTE, true, vb.stride, 16); // normalized
// Per-frame:
function renderFrame(sprites) {
vb.reset();
vb.ensureCapacity(sprites.length * 6);
const f32 = vb.f32, u32 = vb.u32;
const s = vb.strideF32;
const P = vb.offsetF32('pos'), U = vb.offsetF32('uv'), C = vb.offsetU32('color');
let n = vb.count;
for (const spr of sprites) {
/* write 6 verts, n += 6 */
}
vb.count = n;
gl.bindBuffer(gl.ARRAY_BUFFER, vbo);
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vb.u8, 0, vb.byteLength);
gl.drawArrays(gl.TRIANGLES, 0, vb.count);
}How it works
Memory layout
BatchBuffer allocates a single ArrayBuffer sized to stride × maxVertices (rounded up 8 bytes for safe typed-array aliasing). Each attribute aligns to its own element size; total stride is padded to max alignment so strideF32 = stride / 4 is always exact.
flowchart TB
subgraph BB["ArrayBuffer — one allocation, reused forever"]
direction LR
V0["Vertex 0<br/>pos.x | pos.y | uv.x | uv.y | color<br/>4 + 4 + 4 + 4 + 4 = 20 B"]
V1["Vertex 1<br/>20 B"]
V2["Vertex 2"]
V3["Vertex N−1"]
V0 --- V1 --- V2 --- V3
end
F32["vb.f32 : Float32Array<br/>reads pos + uv"] -.-> BB
U32["vb.u32 : Uint32Array<br/>reads color"] -.-> BB
U8["vb.u8 : Uint8Array<br/>used for upload"] -.-> BBAll typed-array views point at the same backing buffer. A write through vb.f32[0] is immediately visible through vb.u8[0..3]. This is why the hot loop can write pos as two floats and color as a packed u32 — they're interleaved in memory exactly as the GPU expects.
The canonical hot loop
sequenceDiagram
participant App
participant VB as BatchBuffer
participant GL as GL / GPU
Note over App,GL: Renderer init (once)
App->>VB: new BatchBuffer({ maxVertices, layout })
VB-->>App: hoist f32, u32, stride, offsets as locals
App->>GL: bufferData(null, DYNAMIC_DRAW)
loop Every frame
App->>VB: reset()
App->>VB: ensureCapacity(n)
Note over App: let c = vb.count
loop Per vertex
App->>VB: f32[c*s + P] = x, ...
App->>VB: u32[c*s + C] = rgba
Note over App: c++
end
App->>VB: vb.count = c
App->>GL: bufferSubData(0, vb.u8, 0, vb.byteLength)
App->>GL: drawArrays(..., vb.count)
endWhy hoist vb.count to a local?
The inline pattern (vb.count++ inside the loop) is 20–30% slower than hoisting the counter into a let:
| Pattern | 40k verts/frame | Notes |
|---|---|---|
| vb.count++ inline | ~0.21 ms | simpler, perfectly fine for small loops |
| Hoisted let n = vb.count | ~0.16 ms | recommended for tight inner loops |
The JITs are good at optimising indexed typed-array access but less good at property access through this. Moving count to a local lets the register allocator do its job.
Case study: a Tiled tilemap renderer
We rendered the same 64 × 64 Tiled tilemap (two layers: ground + decoration ≈ 8 000 visible tiles → ~49 000 vertices per frame) three ways, on the same WebGL 2 pipeline. The only change is the vertex-emit function — the GL state, shaders, texture, and draw call are identical.
You can run this live: open example/tilemap-demo.html and toggle between modes in the sidebar.
Map format
Stock Tiled JSON schema:
{
"width": 64, "height": 64,
"tilewidth": 32, "tileheight": 32,
"layers": [
{ "type": "tilelayer", "name": "ground", "data": [/* 4096 gids */] },
{ "type": "tilelayer", "name": "decoration", "data": [/* 4096 gids */] }
],
"tilesets": [{ "firstgid": 1, "columns": 8, "tilewidth": 32, "tileheight": 32 }]
}data is a flat array of GIDs (1-indexed, 0 = empty). The demo parses this exactly as Tiled exports it.
The three renderers
1. Array-of-objects (the first draft):
function emitAoO(tiles) {
const verts = [];
for (const t of tiles) {
verts.push({ x: t.x, y: t.y, u: t.u0, v: t.v0, color: 0xffffffff });
// ... 5 more per quad
}
const buf = new ArrayBuffer(verts.length * 20);
const f32 = new Float32Array(buf), u32 = new Uint32Array(buf);
for (let i = 0; i < verts.length; i++) { /* flatten */ }
gl.bufferSubData(gl.ARRAY_BUFFER, 0, new Uint8Array(buf));
}2. Naive typed array (fresh per frame):
function emitNaive(tiles) {
const buf = new ArrayBuffer(tiles.length * 6 * 20);
const f32 = new Float32Array(buf), u32 = new Uint32Array(buf);
/* fill in loop */
gl.bufferSubData(gl.ARRAY_BUFFER, 0, new Uint8Array(buf));
}3. BatchBuffer:
const vb = new BatchBuffer({ maxVertices: MAX_TILES * 6, layout: LAYOUT });
const f32 = vb.f32, u32 = vb.u32;
const s = vb.strideF32, P = vb.offsetF32('pos'), U = vb.offsetF32('uv'), C = vb.offsetU32('color');
function emitBatched(tiles) {
vb.reset();
vb.ensureCapacity(tiles.length * 6);
let n = vb.count;
for (const t of tiles) { /* write 6 verts, n += 6 */ }
vb.count = n;
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vb.u8, 0, vb.byteLength);
}Results
Measured on Node 20 / M2 class, median of 5 runs × 120 frames @ 40 000 vertices/frame. Re-run npm run bench to get numbers on your own hardware — the ratios are stable across platforms.
| # | Strategy | ms/frame | MVerts/s | Peak heap Δ | vs best |
|---|---|---:|---:|---:|---:|
| A | BatchBuffer (inline vb.count++) | 0.205 | 195 | 3.4 KB | 2.28× |
| A′ | BatchBuffer (hoisted local count) | 0.157 | 254 | 432 B | 1.75× |
| B | Plain typed-array (hoisted, reused) | 0.090 | 445 | 432 B | — |
| C | Fresh typed-array per frame | 0.484 | 82.7 | 10 KB | 5.38× |
| D | Array-of-objects + flatten | 5.040 | 7.94 | 33.6 MB | 56.1× |
Allocation pressure — log scale
%%{init: {"theme":"dark"}}%%
xychart-beta
title "Peak heap growth per run (KB, log scale) — lower is better"
x-axis ["A inline", "A' hoisted", "B plain TA", "C fresh TA", "D AoO"]
y-axis "KB (log)" 0.1 --> 100000
bar [3.4, 0.4, 0.4, 10, 34000]D allocates 33.6 MB per frame. At 60 fps that's ~2 GB/sec of garbage — the GC can't keep up, and you see stutter.
Frame budget
%%{init: {"theme":"dark"}}%%
xychart-beta
title "ms per frame at 40k verts — lower is better"
x-axis ["A inline", "A' hoisted", "B plain TA", "C fresh TA", "D AoO"]
y-axis "ms" 0 --> 6
bar [0.21, 0.16, 0.09, 0.48, 5.04]A 60 fps frame is 16.67 ms. Both A and A′ consume ~1 %; D burns 30 % before any game logic runs.
When it matters
| Scenario | Verts/frame | Without @zakkster/lite-batch-buffer | With A′ |
|---|---:|---|---|
| Menu UI (20 sprites) | ~120 | irrelevant | irrelevant |
| Platformer (300 sprites) | ~1 800 | usually fine | fine |
| Tiled scroller (8k tiles) | ~49k | 5 ms + GC stutter | ~0.2 ms |
| Particle system (50k) | ~150k | 15+ ms · GC storm | ~0.5 ms |
| Bullet hell (100k) | ~300k | off the budget | ~1.0 ms |
Rule of thumb: once your per-frame vertex count passes ~10 000, the allocation profile of your emit loop matters more than its ALU cost.
API reference
new BatchBuffer({ maxVertices, layout })
| Arg | Type | Description |
|---|---|---|
| maxVertices | number | Hard cap. Sets the backing ArrayBuffer size. |
| layout | LayoutAttribute[] | Ordered list of attributes. |
LayoutAttribute
| Field | Type | Description |
|---|---|---|
| name | string | Used by offset*() lookups. |
| type | 'f32' \| 'i32' \| 'u32' \| 'i16' \| 'u16' \| 'i8' \| 'u8' | Element type. |
| size | number | Element count (2 for vec2, 1 for a packed u32 color). Must be a positive integer. |
Instance members
| Member | Type | Description |
|---|---|---|
| stride | number | Bytes per vertex (padded to max alignment). |
| strideF32, strideU32, strideU16 | number | Stride in those element units. |
| maxVertices | number | As passed. |
| count | number | Writable vertex cursor. Advance inline from the hot loop. |
| attrs | ResolvedAttribute[] | Layout with filled-in byte offsets. |
| arrayBuffer | ArrayBuffer | Single backing allocation. Never reallocated. |
| f32 i32 u32 i16 u16 i8 u8 dv | Typed-array / DataView | All view the same arrayBuffer. |
| capacity | number | Alias for maxVertices. |
| remaining | number | maxVertices − count. |
| byteLength | number | count × stride. Pass this to bufferSubData. |
Methods
| Method | Returns | Description |
|---|---|---|
| offsetBytes(name) | number | Byte offset inside a vertex. Works for any type. |
| offsetF32(name) / offsetI32 / offsetU32 | number | Typed offset. Throws on type mismatch. |
| offsetU16(name) / offsetI16 / offsetU8 / offsetI8 | number | Same, for smaller element types. |
| ensureCapacity(n) | void | Throws if count + n > maxVertices. Call at batch boundaries. |
| reset() | void | Sets count = 0. Does not zero the backing buffer. |
| viewBytes() | Uint8Array | WebGL 1 helper: subarray view of the written prefix. Allocates ~80 B. |
| static packRGBA(r, g, b, a) | number | Pack four 0–255 bytes into a u32 with platform-correct byte order. |
Error conditions
| Situation | Throws |
|---|---|
| Missing maxVertices or empty layout | { maxVertices, layout } required |
| Unknown attribute type | unknown type 'X' |
| Non-positive / non-integer attribute size | size must be positive integer |
| offsetF32('x') where x is declared u32 | 'x' is u32, not f32 |
| offsetBytes('nope') | unknown attribute 'nope' |
| ensureCapacity(n) overflow | capacity exceeded (C + N > MAX) |
Benchmarks
Node CLI
node --expose-gc bench/bench.js
# or: npm run benchRuns all five strategies, prints a formatted table, writes bench/bench-results.json. --expose-gc is required for accurate heap numbers.
Browser (interactive)
Open bench/bench.html in any modern browser. It runs the same strategies live and plots ms/frame and peak heap Δ as bar charts. You can change verts/frame and frame count from the controls at the top right.
Heap numbers are only meaningful on Chromium-based browsers (they expose performance.memory). The time measurements are reliable everywhere.
Testing (for clients & QA)
Three levels of verification, depending on how deep you want to go.
1. Unit tests — "does the library do what it says?"
npm test
# or: node --expose-gc test/edge-cases.test.jsRuns 34 deterministic assertions covering:
| Group | What's tested |
|---|---|
| Construction + validation | missing args, zero / negative / non-integer sizes, unknown types, empty layout |
| Stride + offset arithmetic | mixed alignment (f32 + u8 + f32), u16-after-u8 padding, u8-only layouts, type-checked offsets |
| Typed-array aliasing | f32 / u32 / u8 views all read/write the same bytes |
| Capacity + lifecycle | ensureCapacity boundary conditions, reset() preserves buffer identity |
| packRGBA | endian-correct memory layout, channel clamping, edge values |
| Zero-allocation guarantee | 600 000 vertex writes → heap grows < 1 MB (requires --expose-gc) |
| Realistic round-trip | emit a quad, read bytes back, reuse buffer across 100 frames |
A clean run ends with 34 passed, 0 failed and exit code 0. Any failure prints the assertion plus the expected/actual values. Suitable for CI.
2. Benchmark — "does it perform as claimed?"
npm run benchReproduces the five-strategy comparison in the case study. Exit code is always 0; the failing signal is the numbers. On any 2021+ machine you should see:
A′(hoisted BatchBuffer) within 2× ofB(hand-rolled typed array)D(array-of-objects) at least 20× slower thanBand producing tens of MB of heap growth per run- All three "managed" strategies (A, A′, B) producing < 20 KB of heap growth per run
3. Visual smoke test — "does it render correctly?"
# Just open the file; no build step.
open example/tilemap-demo.htmlA 64 × 64 Tiled tilemap rendered in WebGL 2. In the sidebar, toggle between BatchBuffer, Naive, and Array-of-objects modes — the rendered output must be pixel-identical across all three. If it isn't, the vertex layout is wrong somewhere; that's the QA signal.
On a mid-range 2022 laptop, moving between modes you should see roughly:
| Mode | FPS | CPU ms | Heap delta / 60 frames | |---|---:|---:|---:| | BatchBuffer | 60 (capped) | < 1 ms | < 10 KB | | Naive | 58–60 | 2–3 ms | 1–3 MB | | Array-of-objects | 30–45 (stutter) | 8–15 ms | 20–60 MB |
Heap stats only appear in Chromium browsers (which expose performance.memory).
Quick npm run reference
| Command | What it does |
|---|---|
| npm test | Run the 34-test unit suite |
| npm run bench | Run the Node benchmark, write bench/bench-results.json |
| npm run verify | npm test && npm run bench — the full CI-style check |
| npm run bench:browser | Prints the path to open in your browser |
| npm run demo | Prints the path to the Tiled demo |
Running the demo
example/tilemap-demo.htmlNo build step. No server needed if you run the whole repo over file:// — the demo uses a relative ESM import from ../src/index.js.
Controls:
| Input | Action |
|---|---|
| Left-click + drag | Pan camera |
| W A S D / arrow keys | Pan camera |
| Zoom slider | 0.5× – 4× |
| Mode buttons | Switch vertex-emit strategy live |
The procedural tileset is generated on a <canvas> at startup, so the demo is fully self-contained — no external image assets.
Browser & engine compatibility
The library itself is plain ESM and uses only standard ArrayBuffer / typed-array APIs, so it works everywhere ES2015+ works. The example demo additionally needs WebGL 2 (for VAOs and #version 300 es shaders).
| Target | Library | Demo (WebGL 2) |
|---|---|---|
| Chrome / Edge 61+ | ✅ | ✅ |
| Firefox 60+ | ✅ | ✅ |
| Safari 15+ (iOS 15+) | ✅ | ✅ |
| Node.js 18+ | ✅ | — (browser only) |
| Bun / Deno | ✅ | — |
| WebGL 1 projects | ✅ (use viewBytes()) | — |
| WebGPU projects | ✅ (use vb.arrayBuffer / vb.u8) | — |
Integration snippets
WebGL 2 — upload with zero allocations:
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vb.u8, 0, vb.byteLength);WebGL 1 — the 3-arg bufferSubData signature requires a Uint8Array view, which allocates ~80 B:
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vb.viewBytes());WebGPU — map the GPU buffer, copy from vb.u8:
queue.writeBuffer(gpuBuf, 0, vb.u8, 0, vb.byteLength);Edge cases & guarantees
Behaviours the test suite pins down:
- Stride is always a multiple of 4 for any layout that contains an
f32,i32, oru32. That meansstrideF32andstrideU32are always exact integers — you can index withcount * strideF32safely. u8-only layouts work too. AmaxVertices: 10, layout: [{ type: 'u8', size: 3 }]buffer hasstride = 3, but the backingArrayBufferis padded up to 8 bytes internally so every typed-array view can still be constructed.byteLengthreflects the unpaddedcount × stride, so your GL upload size is correct.reset()does not zero the backing memory. Old vertex data is still there until overwritten. This is intentional — zeroing 320 KB every frame would defeat the point. Only bytes0..byteLengthare ever uploaded.countis public and writable. You can set it directly (e.g.vb.count += 6after writing a quad in one go), or pre-seed it for layered writes. The only invariant the library checks iscount ≤ maxVertices(viaensureCapacity).- All typed-array views alias the same backing buffer. A write through
vb.f32[0]is immediately visible throughvb.u8[0..3],vb.u16[0..1], andvb.dv.getUint32(0, …). packRGBAis endian-safe. On both little-endian (x86, ARM default) and big-endian platforms, the bytes land in memory as R, G, B, A — so GPU vec4 color attributes receive channels in the expected order regardless of host.- The library throws on misuse at construction time or batch boundary, never per vertex. The hot loop does no validation. If you write past
maxVertices, you'll get a typed-array out-of-bounds write (silent no-op in non-strict mode, throws in strict mode). Always pair your emit loop with a precedingensureCapacity.
FAQ
Why a hard maxVertices cap? Why not auto-grow?
Because ArrayBuffer resize would reallocate, re-create every typed-array view, and invalidate the GL buffer you bound to it. The whole point is that none of that ever happens. Pick a number large enough for your worst frame; you pay ~N bytes of RAM, once.
How big should maxVertices be?
Enough for the largest batch you'll flush at once. For a tilemap, that's visibleTiles × 6 (two tris per tile). For a sprite batcher, it's however many sprites you flush per draw call. A 64k-vertex buffer costs 1.3 MB at a 20-byte stride — cheap.
Can I have multiple buffers? Yes. Make one per layout / draw call if you want. They're independent objects.
Why not use DataView?
You can — it's exposed as vb.dv. But typed-array indexed access is faster than DataView.setFloat32(…) on every modern engine and produces far tighter JIT output. Use DataView when you need explicit endianness for non-color data.
Why packRGBA instead of four u8 writes?
One u32 store is ~4× faster than four u8 stores (one instruction vs four, plus better cache behaviour). You can still bind the attribute as 4 × UNSIGNED_BYTE normalized in GL — the shader sees vec4 either way.
What about indexed (ELEMENT_ARRAY) rendering?
Use a second BatchBuffer with { type: 'u16' (or 'u32'), size: 1 }. The library makes no assumptions about whether the buffer is used as ARRAY_BUFFER or ELEMENT_ARRAY_BUFFER.
Does it work in a Web Worker?
Yes. ArrayBuffer + typed arrays are the core of Transferable. You can build vertex data in a worker and postMessage(vb.u8, [vb.arrayBuffer]) to transfer ownership to the main thread — but note that transfers neuter the original, so you'll need a new buffer afterwards. For zero-copy two-way sharing use SharedArrayBuffer (you'll need cross-origin isolation headers).
License
MIT © the author
