dicom-curate
v0.40.2
Published
Organize and de-identify DICOM header data
Readme
dicom-curate
Organize and de-identify DICOM header values and file hierarchies based on a provided configuration object.
⚠️ Disclaimer
This project is currently in a pre-1.0.0 state. APIs and behavior may change at any time without notice.
You're welcome to open issues, but please only do so if you're also willing to contribute a pull request.
Why
This provides an open configuration language and a ready-to-use library for modifying DICOM headers for the purpose of de-identification and organization.
The library can be used in a toolkit-agnostic way, because it provides access to functionality to modify decoded DICOM headers in "DICOM json" format.
Usage
Consuming Dicom-Curate
The build output includes:
- An ESM build, generated by
esbuildwith proper CommonJS dependency handling - A UMD and a minified UMD build, generated by Rollup
They can be consumed as follows:
| File | Used by | How to import / include |
| ------------------------- | -------------------------------- | ---------------------------------------------------------- |
| dist/esm/index.js | Modern bundlers, ESM-aware tools | import ... from 'dicom-curate' |
| dicom-curate.umd.js | CommonJS, Node.js | require('dicom-curate') or require('dicom-curate/umd') |
| dicom-curate.umd.min.js | Browsers via CDN or <script> | <script src=".../dicom-curate.umd.min.js"></script> |
Use the unminified UMD build (/umd) is primarily intended for demos and debugging.
Examples
Converting a nested input folder structure containing DICOM files to a cleaned output folder destination (note: this uses a browser API only supported in Chrome and Edge browsers):
import { curateMany, OrganizeOptions } from 'dicom-curate'
const options: OrganizeOptions = {
inputType: 'directory',
inputDirectory, // input folder directory handle
outputDirectory, // output folder directory handle
curationSpec, // DICOM curation specification
columnMapping, // csv file handle to add csv-based mapping
}
// Read input, map headers, write to well-structured output.
curateMany(options, onProgressCallback)Alternatively, a list of Files is accepted:
const options: OrganizeOptions = {
inputType: 'files',
inputFiles, // list of `File` objects
outputDirectory, // output folder directory handle
curationSpec, // DICOM curation specification
columnMappings, // csv file handle to add csv-based mapping
}If outputDirectory is omitted, output Blobs will be passed to the onProgressCallback function instead.
In the Node.js environment, there are no directory handles. Instead, you may pass directory paths:
const options: OrganizeOptions = {
inputType: 'path',
inputDirectory, // input folder directory path, e.g. "/home/user/files"
outputDirectory, // output folder directory path, e.g. "/home/user/outputs"
curationSpec, // DICOM curation specification
columnMapping, // csv file handle to add csv-based mapping
}It is also possible to save curated files to an HTTP endpoint. Provide a base URL, optionally with additional HTTP headers, and files will be uploaded using a PUT request.
const options: OrganizeOptions = {
inputType: 'path',
inputDirectory, // input folder directory path, e.g. "/home/user/files"
outputEndpoint: {
url: 'http://example.com/base-url',
headers: {
Authorization: 'Bearer xxx',
},
},
curationSpec, // DICOM curation specification
columnMapping, // csv file handle to add csv-based mapping
}The same can be done on the input as well:
const options: OrganizeOptions = {
inputType: 'http',
inputUrls: ['http://example.com/file1.dcm', 'http://example.com/file2.dcm'],
headers: {
Authorization: 'Bearer xxx',
},
// other options
}It is also possible to use S3-compatible buckets as input or output locations.
Consult OrganizeOptions for further details. Please note that this feature is only
available if you have the @aws-sdk/client-s3 package installed.
Custom uploader
For use cases that require resumable, chunked, or otherwise non-standard uploads, you can supply your own uploader via outputUploader. This is mutually exclusive with outputEndpoint.
The uploader runs on the main thread. When using curateMany(), the mapped file's ReadableStream is transferred directly from the worker to the main thread (zero-copy) and handed as-is to the uploader.
import { curateMany, OrganizeOptions, TCustomUploader } from 'dicom-curate'
const myUploader: TCustomUploader = {
async upload({ key, stream, size, contentType, headers, signal }) {
// key – output path relative to the root (URL-encoded path segments)
// stream – ReadableStream<Uint8Array> of the mapped DICOM bytes; consume once
// size – total byte length of the file
// headers – derived metadata headers (X-File-*, x-source-file-hash, …)
// signal – AbortSignal forwarded from OrganizeOptions.signal
const response = await fetch(`https://my-server.example.com/upload/${key}`, {
method: 'PUT',
body: stream,
headers: { 'Content-Length': String(size), ...headers },
signal,
// @ts-expect-error – duplex required by some runtimes for streaming bodies
duplex: 'half',
})
const etag = response.headers.get('ETag') ?? undefined
return { etag }
},
}
const options: OrganizeOptions = {
inputType: 'path',
inputDirectory: '/home/user/files',
outputUploader: myUploader,
curationSpec,
}
await curateMany(files, options)The value returned by upload() is passed back through the CurateResult for each file, so you can store server-assigned identifiers (e.g. ETags, upload IDs) alongside the normal curation results.
Matching S3 ETags across uploaders
When uploading to S3, you can set uploadPartSize on the output S3
options to control the ETag S3 assigns to the written object:
const options: OrganizeOptions = {
// other options are skipped
outputEndpoint: {
bucketName: 'my-bucket',
region: 'us-east-1',
// Bodies <= 5 MB: single PUT, S3 returns a plain-MD5 ETag.
// Bodies > 5 MB: multipart, S3 returns a composite "<md5>-<N>" ETag.
uploadPartSize: 5 * 1024 * 1024,
},
}This matches the ETag convention produced by any S3 client that uses
@aws-sdk/lib-storage at the same partSize, making cross-bucket
"equal bytes ⇒ equal ETag" comparisons well-defined.
When uploadPartSize is omitted, all uploads go through a single PUT
regardless of body size and S3 always returns a plain-MD5 ETag.
This library can now automatically skip writing (or uploading) mapped files if the provided
"previous" input file attributes match the record you pass in the fileInfoIndex property:
const options: OrganizeOptions = {
// other options are skipped
fileInfoIndex: {
// Last observed file size + mtime are provided for this file
// The file will be skipped if these attributes haven't changed
'input_file1.dcm': {
// File size when this input was last processed
size: 123456,
// Last modification time of the file when it was last processed
mtime: '2025-11-12T17:56:13.419Z',
},
// Last observed hash is provided for this input file.
// The hash algorithm used is determined by hashMethod.
// If the input file has the same hash as specified here, it will be skipped as unchanged.
'input_file2.dcm': {
preMappedHash: 'd8e8fca2dc0f896fd7cb4cb0031ba249',
},
// The postMappedHash is the hash of the processed file.
// It is possible to provide the postMappedHash either under the input file path
// or under the output file path.
// In either case, if only postMappedHash is provided for a file, the file has to be
// processed first and only then can it be determined as unchanged and not written or uploaded,
// so the optimization is not as great as when some of the above properties are provided.
//
// To avoid collisions with input file names, key representing output (post-mapped) file names
// need to be prefixed with OUTPUT_FILE_PREFIX.
[OUTPUT_FILE_PREFIX + 'output_file3.dcm']: {
// post-mapped hash
postMappedHash: '126a8a51b9d1bbd07fddc65819a542c3',
},
},
hashMethod: 'md5',
}You can also call curateOne directly and receive a promise with the mapped blob:
import { curateOne, extractColumnMappings } from 'dicom-curate'
// Data prep responsibility for optional table is with caller
const columnMappings = extractColumnMappings([
{ subjectID: 'SubjectID1', blindedID: 'BlindedID1' },
{ subjectID: 'SubjectID2', blindedID: 'BlindedID2' },
])
curateOne({
fileInfo, // path, name, size, kind, blob
mappingOptions: { curationSpec, columnMappings },
})To use a custom uploader with curateOne, set outputTarget.custom: true and provide an uploader function directly on the call arguments:
import { curateOne } from 'dicom-curate'
const result = await curateOne({
fileInfo,
mappingOptions: { curationSpec },
outputTarget: { custom: true },
uploader: async ({ key, stream, size, headers, signal }) => {
const response = await fetch(`https://my-server.example.com/upload/${key}`, {
method: 'PUT',
body: stream,
headers: { 'Content-Length': String(size), ...headers },
signal,
// @ts-expect-error – duplex required by some runtimes for streaming bodies
duplex: 'half',
})
return { etag: response.headers.get('ETag') ?? undefined }
},
})An example DICOM curation function:
import type { TCurationSpecification } from 'dicom-curate'
/*
* Curation specification for batch-curating DICOM files.
*/
export function sampleBatchCurationSpecification(): TCurationSpecification {
const hostProps = {
protocolNumber: 'Sample_Protocol_Number',
activityProviderName: 'Sample_CRO',
centerSubjectId: /^[A-Z]{2}\d{2}-\d{3}$/,
timepointNames: ['Visit 1', 'Visit 2', 'Visit 3'],
// Folder "scan": the trial-specific/provider-assigned series name
scanNames: ['3DT1 Sagittal', 'PET-Abdomen'],
}
return {
// Review the required input folder structure (all DICOM files need minimally this folder depth)
// This configuration depends on correct centerSubjectId, timepoint, scan folder names.
inputPathPattern:
'protocolNumber/activityProvider/centerSubjectId/timepoint/scan',
additionalData: {
// collect from a csv file. A client can use regex to validate the input.
type: 'load',
collect: {
CURR_ID: hostProps.centerSubjectId,
StudyDescription: hostProps.timepointNames,
MAPPED_ID: /BLIND_\d+/,
},
// With this, can refer to mappings as parser.getMapping('blindedId')
mapping: {
// Using the CSV
blindedId: {
value: (parser) => parser.getDicom('PatientID'),
lookup: (row) => row['CURR_ID'],
replace: (row) => row['MAPPED_ID'],
},
},
},
version: '3.0',
hostProps,
// This specifies the standardized DICOM de-identification
dicomPS315EOptions: {
cleanDescriptorsOption: true,
cleanDescriptorsExceptions: ['SeriesDescription'],
retainLongitudinalTemporalInformationOptions: 'Full',
retainPatientCharacteristicsOption: [
'PatientWeight',
'PatientSize',
'PatientAge',
'PatientSex',
'SelectorASValue',
],
retainDeviceIdentityOption: true,
retainUIDsOption: 'Hashed',
retainSafePrivateOption: 'Quarantine',
retainInstitutionIdentityOption: true,
},
modifyDicomHeader(parser) {
const scan = parser.getFilePathComp('scan')
const centerSubjectId = parser.getFilePathComp('centerSubjectId')
return {
// Align the PatientID DICOM header with the centerSubjectId folder name.
PatientID: centerSubjectId,
// This example maps PatientIDs based on the mapping CSV file.
// PatientID: parser.getMapping('blindedId'),
PatientName: centerSubjectId,
// Align the StudyDescription DICOM header with the timepoint folder name.
StudyDescription: parser.getFilePathComp('timepoint'),
// The party responsible for assigning a standard ClinicalTrialSeriesDescription
ClinicalTrialCoordinatingCenterName: hostProps.activityProviderName,
// Align the ClinicalTrialSeriesDescription DICOM header with the scan folder name.
ClinicalTrialSeriesDescription: scan,
}
},
outputFilePathComponents(parser) {
const scan = parser.getFilePathComp('scan')
const centerSubjectId = parser.getFilePathComp('centerSubjectId')
return [
parser.getFilePathComp('protocolNumber'),
parser.getFilePathComp('activityProvider'),
centerSubjectId,
parser.getFilePathComp('timepoint'),
scan + '=' + parser.getDicom('SeriesNumber'),
parser.getFilePathComp(parser.FILEBASENAME) + '.dcm',
]
},
// This section defines the validation rules for the input DICOMs.
// The processing continues on errors, but errors will have to be fixed
// or reviewed between the parties.
errors(parser) {
return [
// File path
[
'Invalid study folder name',
parser.getFilePathComp('protocolNumber') !== hostProps.protocolNumber,
],
// DICOM header
['Missing Modality', parser.missingDicom('Modality')],
['Missing SOP Class UID', parser.missingDicom('SOPClassUID')],
]
},
}
}Excluding files with preExclude and postExclude
The curation specification supports two optional exclusion functions that let you skip files at different stages of processing. Both return true to exclude the file; returning false (or omitting the function entirely) lets the file through.
preExclude — skip before mapping
preExclude receives a parser with access to the original, unmapped DICOM tags. Return true to skip the file entirely — it will not be mapped, written, or uploaded.
export function myCurationSpec(): TCurationSpecification {
return {
// ... other fields ...
// Exclude files whose PatientID doesn't match the expected study format.
preExclude(parser) {
return !/^AB\d{2}-\d{3}$/.test(parser.getDicom('PatientID'))
},
}
}postExclude — skip after mapping
postExclude receives a parser whose getDicom() returns de-identified tag values (PS315E de-identification has already run at this point), and exposes the computed output path as parser.outputFilePath. Return true to skip writing or uploading the mapped file.
Note: parser.getFilePathComp() still returns input path components inside postExclude, the same as in preExclude. Only parser.outputFilePath reflects the post-mapping location, as a full string.
export function myCurationSpec(): TCurationSpecification {
return {
// ... other fields ...
// Exclude structured reports and files routed to an 'exclude' output folder.
postExclude(parser) {
if (parser.getDicom('Modality') === 'SR') return true
if (parser.outputFilePath.includes('/exclude/')) return true
return false
},
}
}Behaviour notes
- Exclusions are re-evaluated on every run. When a
preExcludeorpostExcludeis configured, the "unchanged source bytes" short-circuit is disabled so an exclusion added in a later run takes effect even if the file itself didn't change. - Composition across multiple specs is OR. When
composeSpecsmerges specs that each definepreExclude/postExclude, the composed function excludes a file if any spec's function returnstrue. Evaluation short-circuits on the firsttrue. - Exceptions are fail-safe. If an exclusion function throws, the file is treated as included and the error message is appended to
mapResults.errors.
Result shape
When a file is excluded, curateOne / curateMany still returns a result object for it. The excluded field indicates which function rejected it:
// 'pre' — excluded by preExclude (file was never mapped)
// 'post' — excluded by postExclude (file was mapped but not written)
result.excluded // => 'pre' | 'post' | undefinedTesting
Vitest is split into three projects in vitest.config.ts:
| Command | Project | Location | Purpose |
|---------|---------|----------|---------|
| pnpm test | unit | src/**/*.test.ts | Unit and integration tests co-located with source |
| pnpm test:e2e | e2e | e2e/ | Pipeline smoke tests (runs build:esm first; uses dist/) |
| pnpm test:conformance | conformance | conformance/ | dciodvfy regression (synthetic; optional public/local via env) |
Other scripts:
| Script | Purpose |
|--------|---------|
| pnpm test:coverage | Unit project with coverage |
| pnpm update:conformance-baselines | Regenerate committed dciodvfy baseline JSON |
Conformance tests require the external dciodvfy binary from dicom3tools. See conformance/README.md for install, CI behaviour, baseline refresh, and optional RUN_PUBLIC_CONFORMANCE / CONFORMANCE_LOCAL_* env vars.
Shared test helpers: testutils/ (minimal DICOM files, worker mocks). Fixture generation and public-case fetch: devDependency dicom-synth.
DICOM Conformance Notes
dicom-curate
- does not use an Encrypted Attributes Sequence
- does not anonymize burnt-in information or modify PixelData
- populates the
PatientIdentityRemovedattribute withYES - populates the
LongitudinalTemporalInformationModifiedattribute per DICOM PS3.15E - populates the
DeidentificationMethodattribute with information about this README - populates the
DeidentificationMethodCodeSequencewith the CID7050 codes of provided options, per PS3.15E - keeps only the following in File Meta Information: 'FileMetaInformationVersion', 'ImplementationClassUID', 'ImplementationVersionName', 'MediaStorageSOPClassUID', as well as setting the 'TransferSyntaxUID' to 'Explit little Endian', and 'MediaStorageSOPInstanceUID' to the correct SOP instance UID.
- cleans sequences ('SQ') by recursively applying the de-identification rules to each Dataset in each Item of the Sequence.
- uses an allow-list approach, by removing everything not defined in PS3.06 or handled in PS3.15E1.1.
- identifies and removes additional ID attributes beyond PS3.15E1.1 by parsing PS3.06 and finding all attributes ending on "ID(s)", but not UID(s) that are not defined in PS3.15E. This ID list is defined in "src/config/dicom/retainAdditionalIds.ts", and a few of them are manually annotated to be retained if the "retain device identifier option" is activated.
- keeps the 'EncapsulatedDocument' attribute if modality is "DOC", unless overridden
- keeps the 'VerifyingObserverSequence' if modality is SR, unless overridden
- allows the users to describe all cleaning configurations in the curationSpec file
- implements the following PS3.15E options:
- 'retainDeviceIdentityOption': Keeps the attributes marked as 'K' and performs the default action on all other attributes
- 'cleanDescriptorsOption' by removing all description and comment Attributes except those comment attributes explicitly listed in the
cleanDescriptorExceptionslist. - 'retainLongitudinalTemporalInformationOptions': this considers all temporal attributes (DA, TM, DT), as described as a possible approach in PS3.15E.
Possible values are 'Full' (keep all temporal info intact), 'Off' (remove all temporal attributes or add defaults per PS3.15E), or 'Offset' (move all temporal attributes by a duration. An ISO-8601 compliant duration
dateOffsetparameter must be passed). - 'retainDeviceIdentityOption': true or false. If true, overrides
retainLongitudinalTemporalInformationOptionsfor the respective attributes to keep. - 'retainUIDsOption': 'On', 'Hashed'.
- If 'On', maintain all UIDs.
- If 'Hashed', creates a new UID using an using a decentrally repeatable, hash-based method.
- maintains referential integrity even if de-identifying data in separate, or decentralized, batches
- use if the risk of re-identifying by UID is not bigger than the risk of re-identifying by PixelData
- do not use if you want to specifically protect UIDs from an auxiliary knowledge attack, e.g. an attacker that knows possible input UIDs
- For compatibility, the 'Off' option is now treated the same way as 'Hashed'.
- There are more instance UIDs in part PS3.06 than described in PS3.15E for protection, therefore this option identifies the following uids for protection: 1. All instance UIDs per PS3.15E, 2. Any additional UIDs with a value not well-known in DICOM, per table PS3.06A (Registry of DICOM Unique Identifiers). This protects instance UIDs but also private class UIDs, which is intentional.
- 'retainSafePrivateOption': 'Quarantine' or 'Off'. If 'Quarantine', keeps all private tags but creates a quarantine log for manual review
- 'retainInstitutionIdentityOption': true or false
- does not currently clean structured content
