dicom-curate

v0.20.2

Published

3 days ago

Organize and de-identify DICOM header data

0High
0Medium
0Low

crkaz

bebbi

dicom-curate

Organize and de-identify DICOM header values and file hierarchies based on a provided configuration object.

⚠️ Disclaimer

This project is currently in a pre-1.0.0 state. APIs and behavior may change at any time without notice.

You're welcome to open issues, but please only do so if you're also willing to contribute a pull request.

Why

This provides an open configuration language and a ready-to-use library for modifying DICOM headers for the purpose of de-identification and organization.

The library can be used in a toolkit-agnostic way, because it provides access to functionality to modify decoded DICOM headers in "DICOM json" format.

Usage

Consuming Dicom-Curate

The build output includes:

An ESM build, generated by esbuild with proper CommonJS dependency handling
A UMD and a minified UMD build, generated by Rollup

They can be consumed as follows:

| File | Used by | How to import / include | | ------------------------- | -------------------------------- | ---------------------------------------------------------- | | dist/esm/index.js | Modern bundlers, ESM-aware tools | import ... from 'dicom-curate' | | dicom-curate.umd.js | CommonJS, Node.js | require('dicom-curate') or require('dicom-curate/umd') | | dicom-curate.umd.min.js | Browsers via CDN or <script> | <script src=".../dicom-curate.umd.min.js"></script> |

Use the unminified UMD build (/umd) is primarily intended for demos and debugging.

Examples

Converting a nested input folder structure containing DICOM files to a cleaned output folder destination (note: this uses a browser API only supported in Chrome and Edge browsers):

import { curateMany, OrganizeOptions } from 'dicom-curate'

const options: OrganizeOptions = {
  inputType: 'directory',
  inputDirectory, // input folder directory handle
  outputDirectory, // output folder directory handle
  curationSpec, // DICOM curation specification
  columnMapping, // csv file handle to add csv-based mapping
}

// Read input, map headers, write to well-structured output.
curateMany(options, onProgressCallback)

Alternatively, a list of Files is accepted:

const options: OrganizeOptions = {
  inputType: 'files',
  inputFiles, // list of `File` objects
  outputDirectory, // output folder directory handle
  curationSpec, // DICOM curation specification
  columnMappings, // csv file handle to add csv-based mapping
}

If outputDirectory is omitted, output Blobs will be passed to the onProgressCallback function instead.

In the Node.js environment, there are no directory handles. Instead, you may pass directory paths:

const options: OrganizeOptions = {
  inputType: 'path',
  inputDirectory, // input folder directory path, e.g. "/home/user/files"
  outputDirectory, // output folder directory path, e.g. "/home/user/outputs"
  curationSpec, // DICOM curation specification
  columnMapping, // csv file handle to add csv-based mapping
}

It is also possible to save curated files to an HTTP endpoint. Provide a base URL, optionally with additional HTTP headers, and files will be uploaded using a PUT request.

const options: OrganizeOptions = {
  inputType: 'path',
  inputDirectory, // input folder directory path, e.g. "/home/user/files"
  outputEndpoint: {
    url: 'http://example.com/base-url',
    headers: {
      Authorization: 'Bearer xxx',
    },
  },
  curationSpec, // DICOM curation specification
  columnMapping, // csv file handle to add csv-based mapping
}

The same can be done on the input as well:

const options: OrganizeOptions = {
  inputType: 'http',
  inputUrls: ['http://example.com/file1.dcm', 'http://example.com/file2.dcm'],
  headers: {
    Authorization: 'Bearer xxx',
  },
  // other options
}

It is also possible to use S3-compatible buckets as input or output locations. Consult OrganizeOptions for further details. Please note that this feature is only available if you have the @aws-sdk/client-s3 package installed.

This library can now automatically skip writing (or uploading) mapped files if the provided "previous" input file attributes match the record you pass in the fileInfoIndex property:

const options: OrganizeOptions = {
  // other options are skipped
  fileInfoIndex: {
    // Last observed file size + mtime are provided for this file
    // The file will be skipped if these attributes haven't changed
    'input_file1.dcm': {
      // File size when this input was last processed
      size: 123456,
      // Last modification time of the file when it was last processed
      mtime: '2025-11-12T17:56:13.419Z',
    },
    // Last observed hash is provided for this input file.
    // The hash algorithm used is determined by hashMethod.
    // If the input file has the same hash as specified here, it will be skipped as unchanged.
    'input_file2.dcm': {
      preMappedHash: 'd8e8fca2dc0f896fd7cb4cb0031ba249',
    },
    // The postMappedHash is the hash of the processed file.
    // It is possible to provide the postMappedHash either under the input file path
    // or under the output file path.
    // In either case, if only postMappedHash is provided for a file, the file has to be
    // processed first and only then can it be determined as unchanged and not written or uploaded,
    // so the optimization is not as great as when some of the above properties are provided.
    //
    // To avoid collisions with input file names, key representing output (post-mapped) file names
    // need to be prefixed with OUTPUT_FILE_PREFIX.
    [OUTPUT_FILE_PREFIX + 'output_file3.dcm']: {
      // post-mapped hash
      postMappedHash: '126a8a51b9d1bbd07fddc65819a542c3',
    },
  },
  hashMethod: 'md5',
}

You can also call curateOne directly and receive a promise with the mapped blob:

import { curateOne, extractColumnMappings } from 'dicom-curate'

// Data prep responsibility for optional table is with caller
const columnMappings = extractColumnMappings([
  { subjectID: 'SubjectID1', blindedID: 'BlindedID1' },
  { subjectID: 'SubjectID2', blindedID: 'BlindedID2' },
])

curateOne({
  fileInfo, // path, name, size, kind, blob
  mappingOptions: { curationSpec, columnMappings },
})

An example DICOM curation function:

import type { TCurationSpecification } from 'dicom-curate'

/*
 * Curation specification for batch-curating DICOM files.
 */
export function sampleBatchCurationSpecification(): TCurationSpecification {
  const hostProps = {
    protocolNumber: 'Sample_Protocol_Number',
    activityProviderName: 'Sample_CRO',
    centerSubjectId: /^[A-Z]{2}\d{2}-\d{3}$/,
    timepointNames: ['Visit 1', 'Visit 2', 'Visit 3'],
    // Folder "scan": the trial-specific/provider-assigned series name
    scanNames: ['3DT1 Sagittal', 'PET-Abdomen'],
  }

  return {
    // Review the required input folder structure (all DICOM files need minimally this folder depth)
    // This configuration depends on correct centerSubjectId, timepoint, scan folder names.
    inputPathPattern:
      'protocolNumber/activityProvider/centerSubjectId/timepoint/scan',

    additionalData: {
      // collect from a csv file. A client can use regex to validate the input.
      type: 'load',
      collect: {
        CURR_ID: hostProps.centerSubjectId,
        StudyDescription: hostProps.timepointNames,
        MAPPED_ID: /BLIND_\d+/,
      },
      // With this, can refer to mappings as parser.getMapping('blindedId')
      mapping: {
        // Using the CSV
        blindedId: {
          value: (parser) => parser.getDicom('PatientID'),
          lookup: (row) => row['CURR_ID'],
          replace: (row) => row['MAPPED_ID'],
        },
      },
    },

    version: '3.0',
    hostProps,

    // This specifies the standardized DICOM de-identification
    dicomPS315EOptions: {
      cleanDescriptorsOption: true,
      cleanDescriptorsExceptions: ['SeriesDescription'],
      retainLongitudinalTemporalInformationOptions: 'Full',
      retainPatientCharacteristicsOption: [
        'PatientWeight',
        'PatientSize',
        'PatientAge',
        'PatientSex',
        'SelectorASValue',
      ],
      retainDeviceIdentityOption: true,
      retainUIDsOption: 'Hashed',
      retainSafePrivateOption: 'Quarantine',
      retainInstitutionIdentityOption: true,
    },

    modifyDicomHeader(parser) {
      const scan = parser.getFilePathComp('scan')
      const centerSubjectId = parser.getFilePathComp('centerSubjectId')

      return {
        // Align the PatientID DICOM header with the centerSubjectId folder name.
        PatientID: centerSubjectId,
        // This example maps PatientIDs based on the mapping CSV file.
        // PatientID: parser.getMapping('blindedId'),
        PatientName: centerSubjectId,
        // Align the StudyDescription DICOM header with the timepoint folder name.
        StudyDescription: parser.getFilePathComp('timepoint'),
        // The party responsible for assigning a standard ClinicalTrialSeriesDescription
        ClinicalTrialCoordinatingCenterName: hostProps.activityProviderName,
        // Align the ClinicalTrialSeriesDescription DICOM header with the scan folder name.
        ClinicalTrialSeriesDescription: scan,
      }
    },

    outputFilePathComponents(parser) {
      const scan = parser.getFilePathComp('scan')
      const centerSubjectId = parser.getFilePathComp('centerSubjectId')

      return [
        parser.getFilePathComp('protocolNumber'),
        parser.getFilePathComp('activityProvider'),
        centerSubjectId,
        parser.getFilePathComp('timepoint'),
        scan + '=' + parser.getDicom('SeriesNumber'),
        parser.getFilePathComp(parser.FILEBASENAME) + '.dcm',
      ]
    },

    // This section defines the validation rules for the input DICOMs.
    // The processing continues on errors, but errors will have to be fixed
    // or reviewed between the parties.
    errors(parser) {
      return [
        // File path
        [
          'Invalid study folder name',
          parser.getFilePathComp('protocolNumber') !== hostProps.protocolNumber,
        ],
        // DICOM header
        ['Missing Modality', parser.missingDicom('Modality')],
        ['Missing SOP Class UID', parser.missingDicom('SOPClassUID')],
      ]
    },
  }
}

DICOM Conformance Notes

dicom-curate

does not use an Encrypted Attributes Sequence
does not anonymize burnt-in information or modify PixelData
populates the PatientIdentityRemoved attribute with YES
populates the LongitudinalTemporalInformationModified attribute per DICOM PS3.15E
populates the DeidentificationMethod attribute with information about this README
populates the DeidentificationMethodCodeSequence with the CID7050 codes of provided options, per PS3.15E
keeps only the following in File Meta Information: 'FileMetaInformationVersion', 'ImplementationClassUID', 'ImplementationVersionName', 'MediaStorageSOPClassUID', as well as setting the 'TransferSyntaxUID' to 'Explit little Endian', and 'MediaStorageSOPInstanceUID' to the correct SOP instance UID.
cleans sequences ('SQ') by recursively applying the de-identification rules to each Dataset in each Item of the Sequence.
uses an allow-list approach, by removing everything not defined in PS3.06 or handled in PS3.15E1.1.
identifies and removes additional ID attributes beyond PS3.15E1.1 by parsing PS3.06 and finding all attributes ending on "ID(s)", but not UID(s) that are not defined in PS3.15E. This ID list is defined in "src/config/dicom/retainAdditionalIds.ts", and a few of them are manually annotated to be retained if the "retain device identifier option" is activated.
keeps the 'EncapsulatedDocument' attribute if modality is "DOC", unless overridden
keeps the 'VerifyingObserverSequence' if modality is SR, unless overridden
allows the users to describe all cleaning configurations in the curationSpec file
implements the following PS3.15E options:
- 'retainDeviceIdentityOption': Keeps the attributes marked as 'K' and performs the default action on all other attributes
- 'cleanDescriptorsOption' by removing all description and comment Attributes except those comment attributes explicitly listed in the cleanDescriptorExceptions list.
- 'retainLongitudinalTemporalInformationOptions': this considers all temporal attributes (DA, TM, DT), as described as a possible approach in PS3.15E. Possible values are 'Full' (keep all temporal info intact), 'Off' (remove all temporal attributes or add defaults per PS3.15E), or 'Offset' (move all temporal attributes by a duration. An ISO-8601 compliant duration dateOffset parameter must be passed).
- 'retainDeviceIdentityOption': true or false. If true, overrides retainLongitudinalTemporalInformationOptions for the respective attributes to keep.
- 'retainUIDsOption': 'On', 'Hashed'.
  - If 'On', maintain all UIDs.
  - If 'Hashed', creates a new UID using an using a decentrally repeatable, hash-based method.
    - maintains referential integrity even if de-identifying data in separate, or decentralized, batches
    - use if the risk of re-identifying by UID is not bigger than the risk of re-identifying by PixelData
    - do not use if you want to specifically protect UIDs from an auxiliary knowledge attack, e.g. an attacker that knows possible input UIDs
  - For compatibility, the 'Off' option is now treated the same way as 'Hashed'.
  - There are more instance UIDs in part PS3.06 than described in PS3.15E for protection, therefore this option identifies the following uids for protection: 1. All instance UIDs per PS3.15E, 2. Any additional UIDs with a value not well-known in DICOM, per table PS3.06A (Registry of DICOM Unique Identifiers). This protects instance UIDs but also private class UIDs, which is intentional.
- 'retainSafePrivateOption': 'Quarantine' or 'Off'. If 'Quarantine', keeps all private tags but creates a quarantine log for manual review
- 'retainInstitutionIdentityOption': true or false
does not currently clean structured content

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

dicom-curate

⚠️ Disclaimer

Why

Usage

Consuming Dicom-Curate

Examples

DICOM Conformance Notes