@nullplatform/tracing

v0.0.16

Published

12 days ago

Producer-side SDK for the nullplatform tracing API (graph-fact + facets wire contract)

0High
0Medium
0Low

geisbruch

pablovilas

nullplatform tracing observability telemetry sdk

Producer-side SDK for the nullplatform tracing API. Wraps the wire contract in a typed TypeScript surface so producers don't have to hand-build envelopes. Zero runtime dependencies; ESM + CJS; Node, browser, and edge.

Runnable examples (one per scenario, full coverage): examples/ — start with 01-quickstart.cjs.

Using an AI coding assistant? llms.txt is a dense, agent-optimized usage guide (rules, full API surface, and a removed-API anti-patterns table) — ships with the package.

Quick start

import { Tracer, jobRef } from '@nullplatform/tracing';

const tracer = new Tracer({
  baseURL: process.env.TRACING_URL!,
  apiKey: process.env.NULLPLATFORM_API_KEY!,
  producer: '[email protected]',
});

// Callback mode: `started` on entry, `completed` on return, `failed` on throw.
// A spec carries IDENTITY only; labels/facets are chainable setters.
await tracer.run({ trace_id: 'D-9007', run_id: 'D-9007' }, async (run) => {
  run.labels({ release: 'R-127', env: 'prod' });
  // A job is per-tenant: the ref carries the definition's own nrn (any level).
  run.instanceOf(jobRef('nullplatform', 'deploy', '42', 'organization=1'));
  await run.step({ key: 'provision' }, () => provision());
});

// Drain before shutdown.
const stats = await tracer.shutdown(); // { accepted, duplicate, rejected }

Label values may be string, number, or boolean — the SDK stringifies them on the wire, so you never write String(id). A null/undefined label value is dropped rather than recorded as "null".

Trace & run ids

run_id defaults to a generated UUIDv7. trace_id defaults to run_id — a lone root anchors its own trace (every node has a trace). Set trace_id explicitly to thread related work into one trace: at an ingress, or from an inbound carrier (extractTraceContext). trace_id names the OPERATION, not the entity: the entity's chronicle (every run that ever touched it) is a LABEL query — ?labels.<entity>.id=<id> — never a trace_id. Carrying the entity as a label (not baked into the id) is what lets a create be traced before its id exists, and a failed insert still be found by <parent>.id + action.

A run_id names ONE execution occurrence — and ingest folds events by run_id, so if a repeatable action (an update, a retried create, a re-triggerable workflow) reuses a deterministic id, the new occurrence silently overwrites the previous run's history instead of becoming a new run. Pick the run-id form by what the operation is:

Deliberately one run — a self-looping flow resumed across re-enqueues, or an operation that genuinely cannot re-execute under the same identity → deterministic key(...parts). (Entity actions rarely qualify: creates and deletes get retried under the same entity id, just like updates.)
Repeatable, and a natural execution id exists (a queue MessageId, a CI job id, a request id) → key(...parts, executionId). This is also the cross-system rendezvous: every observer of that execution id derives the SAME run id, and a redelivery resumes the same run instead of minting a phantom sibling.
Repeatable, no execution id, purely local → mint the occurrence: key(...parts, occurrence()) (application-update-01977f3a…; the entity id is a label, never baked into the id). The token is a UUIDv7 — the same generator as every event id, and time-ordered, so the ids sort by creation time. Nobody else can derive a minted id — if another system needs it, hand it forward with injectTraceContext; never make it guess.

Tag every entity run with canonical labels (entity, action) and keep the id's action segment the SAME WORD as labels.action — labels are the exact query axis (?trace_id=application-1234&labels.action=update), the id prefix is the human-readable one. A genuine retry that knows the prior attempt's run id can declare it: retryRun.retryOf(runRef(trace, priorRunId)).

Tenancy scope (`nrn`)

By default the SDK sends no nrn and the API derives it from the caller's token (the common case). Set nrn on the run spec to scope the run — and all its steps, edges, and terminals — to a precise resource; a per-call nrn overrides for a single emit:

const run = tracer.run({ trace_id, run_id, nrn: 'organization=1:application=42' });
run.step('build');   // inherits the run's nrn
run.complete();      // inherits — or .complete({ nrn }) to override one emit

`baseURL` and `authBaseURL`

Both default to the public API, so a public consumer sets neither:

new Tracer({ apiKey, producer }); // posts to https://api.nullplatform.com/tracing/events

baseURL defaults to https://api.nullplatform.com/tracing (the SDK posts to `${baseURL}/events`); the apiKey token exchange (POST {authBaseURL}/token) defaults to https://api.nullplatform.com. Override both for a private or in-cluster deployment. The SDK never reads process.env — pass any override explicitly.

Run & step lifecycle: callback vs handle

A run (and each step) has two equivalent forms. Both emit started on open and exactly one terminal (completed/failed/skipped/cancelled/ timed_out).

Callback — pass a function. The SDK emits completed on return, failed on throw (then rethrows), so there's no bookkeeping:

await tracer.run({ trace_id, run_id }, async (run) => {
  await run.step({ key: 'charge' }, () => charge());
});

Handle — get the object and close it yourself. For flows that span calls / ticks / processes, or when you need run.run_id before the work finishes:

const run = tracer.run({ trace_id, run_id });
try {
  const charge = run.step({ key: 'charge' });
  await charge();
  charge.complete();
  run.complete();
} catch (error) {
  run.fail(error); // cascades: any still-open child step is failed too
  throw error;
}

Which to use — the callback's only cost is one level of indentation, and that cost is proportional to how much code the closure wraps:

Callback for a short traced region or newly written block — the indentation is negligible and you get auto-terminalize for free.
Handle when wrapping a long existing method body (a closure would re-indent and balloon the diff), or when the run/step outlives a single synchronous scope.

Fail cascade. Calling fail() on a run or step finalizes any of its still-open handle-mode child steps with the same error — so a catch collapses to one run.fail(error) regardless of how many steps are open, with no per-step ?.fail() plumbing. A child you already closed wins (the single-terminal guard makes the cascade a no-op for it). complete() does not cascade: auto-completing an open child would back-date its duration and assert a success the SDK can't vouch for — close steps explicitly on the happy path. Callback-mode children self-terminalize, so they're already closed before control returns to the parent.

Attaching context: labels & core facets

Runs and steps describe what happened with labels (small string key/values, queryable) and facets (structured bodies). The wire contract reserves 15 core facets under tracing.*; the SDK exposes one typed, chainable setter for each, so you never hand-write the namespace string:

const run = tracer.run({ trace_id, run_id })
  .labels({ entity: 'application', action: 'create', 'application.id': id })
  .input('create-params', { name, namespaceId })          // tracing.input descriptor
  .output('application', { id, slug, status })             // tracing.output descriptor
  .actor({ kind: 'user', id: userId, source: 'api' });

// `decision`/`retry`/`signal` are STEP-only — the type system won't let you
// call them on a run, so a mis-placed core facet can't be written.
await run.step('apply-settings', async (step) => {
  step.decision({ chosen: status, available: ['pending', 'pending_hook'],
                  expression: 'a gating hook holds it at pending_hook' });
});

The setters take their §8 body verbatim (decision is { chosen, available?, expression? }; actor is { kind, id, source? }) — no renamed fields. (input/output are the exception: there is no raw setter — their descriptors are set only through the typed builders below.) Each setter returns the handle (chain freely), is a no-op on an empty/undefined body, and accumulates onto the terminal event. A setter called synchronously (before the first await) also lands on the started event, so an in-flight run/step is visible with its start-time context (who started it, its inputs, its labels) rather than a bare identity; a setter called after an await lands on the terminal — or, when no lifecycle emit happens within the late-enrichment window (LATE_ENRICHMENT_FLUSH_MS, 200ms), on a coalesced re-emit of the node's last status, so long-lived staging (a plan recorded mid-flight, a spanning step's refreshed counts) reaches the read model even on a handle that never terminalizes. A short operation's own terminal absorbs the flush — no extra event. (started is emitted one microtask after construction to make this work — see Run & step lifecycle.) They're placed by node kind: a run and step share input/output/actor/timing/ externalLinks/affordances/engineStatus/dropped/error/explain/progress; decision/retry/signal are step-only; plan is on a run (override) or a job. Use your own namespace via .facet('myapp.thing', { … }) for non-core data — the tracing.* prefix is reserved.

.progress({ current, target, unit? }) (unit: percent|count|bytes|milliseconds — a format hint, not a noun) marks a converging phase — a step whose occurrences each move a value toward a declared target (traffic 40→100%, instances 3→10). The plan-progress read keeps the step's plan slot open until current reaches target, so a completed partial occurrence advances the phase instead of concluding it.

.affordances([{ kind, …params }]) declares what a node offers a human to do — a possibility for action a UI renders as a control (switch traffic, view live logs, approve). It carries the kind (which control) plus free-form data params, never a component. Most controls need no declaration — a live-log viewer rides an .outputPointer() you already emit, an approval rides .signal(); reach for .affordances() only for a bespoke control:

await run.step('switchTraffic', async (step) => {
  step
    .outputPointer('deploy-log', logsUri)        // logs: the io pointer renders a viewer for free
    .affordances({ kind: 'traffic-control' });   // the one bespoke control (lenient: bare object ⇒ [it])
});

const run = tracer.run({ trace_id, run_id })
  .actor(userToken)                                 // a JWT string → actor + default nrn; start-time
  .input('create-params', input)                    // start-time → on `started` + terminal
  .labels({ entity: 'application', action: 'create', 'application.status': 'pending' });
const application = await persist(input);            // ← the work (the run brackets it)
run.output('application', application)               // result → terminal only
   .labels({ 'application.id': application.id, 'application.status': 'active' })  // entity id once it exists
   .complete();

Timing is automatic. Every run/step carries a tracing.timing facet the SDK fills from the operation it brackets: started_at is stamped when you open the node (tracer.run()/run.step()), ended_at when you close it. So every traced node gets start/end/duration with no effort — you only call .timing({ … }) yourself to override (per field), e.g. backfilling real historical times for an operation that already finished. Open the node at the top of the operation so started_at is accurate.

Definition nodes (job/dataset) have no lifecycle — they're a single emit that fires lazily, on first use. tracer.job(...)/tracer.dataset(...) return a chainable handle: its spec carries identity only, and you enrich it with the same typed setters (.labels(), .facet(), .schema(), and a job's typed .plan() — no 'tracing.plan' string). The node emits exactly once, the first time you reference it from an edge, await it (resolves to its ref), or call .emit() — so a handle referenced by several edges emits one node, and a handle you never use emits nothing. Once emitted it's frozen (a later setter throws rather than silently lose data). The common path needs no ceremony:

const provisionJob = tracer
  .job({ namespace: 'nullplatform', name: 'application-provision', version: '7.2' })
  .labels({ team: 'platform' })
  .plan({ steps: [{ key: 'build' }, { key: 'deploy', after: 'build' }] });

const image = tracer.dataset('image:sha256:abc');   // nothing on the wire yet

run.instanceOf(provisionJob);   // ← emits the job node here, then the edge
build.produces(image);          // ← emits the dataset node here, then the edge
deploy.consumes(image);         // ← reuses the node; just the consumes edge

await tracer.dataset('snapshot');                              // await emits + resolves to its ref
await tracer.job({ namespace, name, version }).emit();        // or emit eagerly, ignoring/await the ref

Ergonomic shorthands (sugar over the same wire shapes):

tracing.input/tracing.output are set only through six typed builders — one per kind × direction, no raw descriptor array, no kind/direction magic strings. Call them as handle methods to accumulate node io — .input/.output(name, value) (inline), .inputRef/.outputRef(name, source, externalId) (ref), .inputPointer/.outputPointer(name, uri, { size_bytes?, content_type? }) (pointer — large data referenced by URI) — or import the same six as standalone functions and hand one to produces/consumes to declare the edge's io once (records the node descriptor AND derives the edge binding — see below).
.decision({ chosen }) accepts a bare string for chosen (normalized to an array).
.plan({ groups?, steps }) — each step is a plain typed object { key, title?, description?, after?, sla?, optional?, group? } (no element builder; a plan step is inert data, so a step() factory would be a redundant second way). A step's after accepts a bare string or an array (after: 'build' ≡ after: ['build']), and sla.after_lifecycle is typed to the Status enum so a misspelled state won't compile. GROUPS — the macro phases a host renders as stages/stations — are declared once in groups: [{ key, title?, group? }] (nesting via the parent-group group ref) and each step references its group by leaf key (group: 'ship'); the SDK derives the full wire path (outermost → innermost segments) and stamps the titles, so a title exists in exactly one place. With no groups block, a bare key or key-path array forms an untitled group; group-less steps render flat. .plan() validates client-side before send: a duplicate step key, an after referencing an undeclared step, a duplicate/undeclared/cyclic group, or a bad key throws (not a silent ingest drop).
Every edge method accepts a handle, a ref, or a definition handle (job/dataset, emitted on first use), and produces/consumes also a bare dataset id: run.triggeredBy(otherRun), run.instanceOf(deployJob), run.produces('image:sha256:…'), step.compensates(otherStep). The causal/job verbs (triggeredBy, retryOf, continues, correlates, instanceOf) live on both grades — a step records step-grain causality (downstream.triggeredBy(batchStep)) exactly like a run; only compensates is keyed-to-keyed (the wire rule). To describe the payload flowing over a produces/consumes edge, pass an io builder as the second arg — step.produces(image, outputPointer('image', ref, { content_type })), step.consumes(raw, inputPointer('raw-blob', uri, { size_bytes })) — which records the step's io descriptor AND derives the edge's tracing.binding ({ name, content_type?, size_bytes? }) in one call. Direction is enforced by the type: produces takes an output*, consumes an input*.
node.run('sub-id', cb?) opens a named child run under any node — the containment edge.parent is auto-emitted, the child shares the trace and inherits the parent's current nrn, and it starts a NEW scope for its own keyed steps. Use it when the nested work is a unit of its own (a step spawning a sub-workflow: deployStep.run('provision-x', cb)); use step('key') for a keyed stage of the SAME unit. Same callback/handle dual form and fail-cascade adoption as step().
tracer.run('id', cb?), run.step('charge', cb?), node.run('sub-id', cb?) and tracer.dataset('id') all accept a bare single id string (one unambiguous arg). .externalLinks() and .affordances() each accept a single object or a bare array (normalized to the wire bare-array form on emit, like .decision({ chosen })). A job takes a named identity spec — tracer.job({ namespace, name, version }), never three positional strings, so the call reads unambiguously. Producer-named ids fail fast: a character outside the identifier charset ([A-Za-z0-9_.-] — the ~ delimiter and locator punctuation like / : @ = are out), an empty string, or an over-cap id (run_id > 1024, trace_id > 256, step key > 256) throws AT THE CALL SITE — never an async ingest drop far from the bug. Cross-process refs compose to any depth: a sub-step's ref is stepRef(trace, deriveChildId(runId, 'build', 0, 0), 'push') — the parent argument accepts a derived id. key('application', id) builds a stable id string — use it anywhere one is needed (trace_id/run_id in a spec, or a runRef/stepRef argument), so a key is never hand-interpolated. null/ undefined/empty parts are dropped, so a missing part can't fork the trace. For a repeatable action's run id, end the key with a per-occurrence token — a shared execution id, or a minted occurrence() (see "Trace & run ids" for when to use which form).
.actor() accepts a bearer-JWT string or an ActorFacetInput ({ kind, id, source? }). A JWT is decoded internally (no public decoder): the actor id comes from nullplatform's cognito:groups @nullplatform/user (falling back to standard sub), and @nullplatform/organization → the run's default nrn = organization=N (inherited by every step/edge, overridable by an explicit spec nrn); a non-JWT string (an api key) omits the actor. The object form is the §8 actor body and sets the actor only — no organization field (it's nullplatform-specific; for raw ids that need an org scope, set the spec nrn).

Label values may be string | number | boolean and are coerced for you.

Coverage checklist: is the entity fully observable?

Setting identity and a few labels makes an operation findable. Use this checklist as a Definition of Done to confirm the entity is fully observable — that anyone (or any AI) can answer the questions that matter about it. For each row, check that a reader can answer it; if not, emit what the Emit column names. A skipped row is a silent gap: the code runs and the tests pass, but the entity is only half-traced.

| # | A reader must be able to… | Emit | Confirm with | |---|---|---|---| | 1 | Chronicle — everything that happened to entity X | entity + action + <entity>.id on every run (plus each <parent>.id up front) | ?labels.<entity>.id=<id> returns its whole life | | 2 | Every operation, not just create | one run per operation including updates — create / update / delete / provision / async, keyed by action | ?labels.<entity>.id=<id>&labels.action=<verb> for every verb | | 3 | Failures, even before an id exists | open the run at the top (before validation/insert) and fail(err) on throw | ?labels.<parent>.id=<id>&labels.action=create&status=failed finds it | | 4 | Progress of a multi-step operation | a .plan() (or instanceOf a job that has one) and a .step() per real task | GET /runs/:id/progress and /runs/:id/steps | | 5 | Lineage — what an operation produced or consumed | produces / consumes for every artifact that flows downstream (a repo, an image, a dataset), keyed by its canonical address | GET /runs/:id/lineage connects producer → consumer | | 6 | Read it — for a human and an AI | .explain({ title, what }) (the story) + the io builders (structured/queryable payload) + .actor() + an affordance for any bespoke UI control | a run shows a title, a one-line story, its inputs/outputs, and who acted |

The two gaps this most often catches:

Untraced updates (row 2): you traced create and delete, but a plain update opens no run — so "what changed, and did the last update fail?" is unanswerable. Trace every operation, keyed by action.
An artifact with no lineage edge (row 5): the operation creates a repository, an image, a bucket, or a record and emits no produces — so /lineage is empty and nothing downstream connects to it. Every artifact that flows onward is a produces (or consumes) keyed by its canonical id.

Authentication

Provide exactly one of apiKey or getToken.

Recommended — apiKey. Pass your nullplatform API key and the SDK handles the token lifecycle for you: it exchanges the key for a short-lived JWT at POST {authBaseURL}/token, caches it in memory, and refreshes it before expiry (and once reactively on a 401). Concurrent emits that all need a token collapse into a single exchange.

const tracer = new Tracer({
  baseURL: process.env.TRACING_URL!,
  apiKey: process.env.NULLPLATFORM_API_KEY!,
  // authBaseURL defaults to https://api.nullplatform.com
  producer: '[email protected]',
});

The SDK reads no environment variables — pass everything explicitly (the example above reads process.env in your own code and hands the value to the config; the SDK never touches the environment itself).

Escape hatch — getToken. If you already manage tokens externally (e.g. a pre-fetched JWT injected into a Lambda), supply a callback instead. It is called per request, so cache the token yourself and only refresh when it's about to expire — the SDK does not cache the result, and a 401 is treated as deterministic (no automatic re-exchange).

const tracer = new Tracer({
  baseURL: process.env.TRACING_URL!,
  getToken: () => process.env.TRACING_TOKEN!,
  producer: '[email protected]',
});

A token exchange that fails transiently (network error or 5xx from the auth endpoint) is retried alongside the event POST; a 4xx (bad/expired apiKey) is deterministic and surfaces as a TransportError — via the rejected promise in { sync: true } mode, or the 'drop' event in the buffered path.

Delivery model

Default — async-buffered. Each emit enqueues and returns immediately with the assigned id. A background worker flushes the queue every flushIntervalMs (default 1000ms) or whenever it reaches flushBatchSize events (default 100).

Opt-in — synchronous. Pass { sync: true } in the per-call options to bypass the queue and await the POST directly:

run.complete({ sync: true });

Errors throw to the caller (not surfaced via the 'drop' event).

Idempotency

Every emit assigns a UUIDv7 id if the producer didn't provide one. The API deduplicates on id (ON CONFLICT (event_id) DO NOTHING), so re-emitting the same event after a crash is safe.

Producers MAY supply a deterministic id from business data for idempotent retry-after-crash:

run.step({ key: 'charge' }, () => charge(), { id: makeId('charge', attempt) });

Error handling

tracer.on('drop', (envelope, reason) => {
  // log, alert, persist to disk, etc.
});

Drop events fire for:

Events that exhaust retries (5xx / network failures).
Events deterministically rejected by the server (4xx).
Events dropped due to queue overflow (maxQueueSize).

Per-event reject also propagates to the awaited promise of a { sync: true } emit (e.g. await run.complete({ sync: true })), so callers can handle failures inline instead of via the 'drop' event if they prefer.

Listener errors are swallowed — a buggy on('drop') handler cannot prevent other listeners or the queue from making progress.

400 / 401 / 403 are NOT retried (deterministic). 5xx and network failures ARE retried with exponential backoff + jitter.

Queue overflow

The queue is bounded by maxQueueSize (default 10000). When the queue is at capacity and a new event arrives, the oldest queued event is dropped (with the 'drop' event fired). Use a 'drop' listener to track.

Flush / shutdown

const stats = await tracer.flush(); //    { accepted, duplicate, rejected }
const stats = await tracer.shutdown(); // drain + reject further emits

Both return per-outcome counts. shutdown() is idempotent; after shutdown, further emits reject.

On Node, an enabled tracer auto-drains on SIGTERM/SIGINT/beforeExit so a process that just emits and exits doesn't lose buffered events. The hooks are removed again by shutdown(). Opt out with shutdownHooks: false if your app manages its own exit ordering, and install hooks yourself with installNodeShutdownHooks(tracer, { signals }) for custom signals. The hooks are a no-op off Node (browser/edge).

Wire contract version

This SDK targets v1 of the nullplatform tracing wire contract (the envelope shape). Adding a new event type to the wire spec is a major SDK version bump.

Releasing

The wire contract is vendored under src/events/ and bundled into the build, so the published package has zero runtime dependencies. Releases are a manual version bump plus a GitHub Release:

Bump version in package.json (the version that ships is whatever is at the released ref).
Publish a GitHub Release. The publish workflow builds + tests the SDK and publishes it to npm via trusted publishing (OIDC — no long-lived token).

First publish only: a Trusted Publisher can be configured on npm only after the package exists, so the very first publish is manual (npm login && npm publish -w @nullplatform/tracing --access public); configure the Trusted Publisher afterwards so every subsequent GitHub Release publishes automatically.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme