npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@bioscript/seq-format

v0.1.0

Published

Bioinformatics file format converters and parsers (GenBank, EMBL, GFF/GTF, BED, VCF, SAM/BAM)

Downloads

53

Readme

@bioscript/seq-format

Comprehensive bioinformatics file format converters and parsers for GenBank, EMBL, GFF/GTF, BED, VCF, SAM, CIGAR, and Newick formats.

Features

GenBank Support - Parse and convert GenBank ↔ FASTA with feature extraction
EMBL Support - Parse EMBL format to FASTA
GFF/GTF Support - Parse and write GFF3 and GTF annotations
BED Support - Handle BED3, BED6, and BED12 formats
VCF Support - Parse and write variant call files
SAM Support - Parse SAM alignment files with flag decoding
CIGAR Utilities - Parse, analyze, and visualize CIGAR strings
Newick Trees - Parse and manipulate phylogenetic trees
🚀 High Performance - Optimized parsers with O(n) complexity
📦 Zero Dependencies - Pure TypeScript, no external dependencies
🔒 Type Safe - Full TypeScript support with strict types
Well Tested - 239 tests with 95.86% coverage

Installation

npm install @bioscript/seq-format

Quick Start

import { parseGenBank, genBankToFasta } from '@bioscript/seq-format';
import { readFileSync } from 'fs';

// Parse GenBank file
const gbText = readFileSync('sequence.gb', 'utf-8');
const gbRecord = parseGenBank(gbText);
console.log(gbRecord.locus, gbRecord.sequence.length);

// Convert to FASTA
const fastaRecords = genBankToFasta(gbRecord);
console.log(`>${fastaRecords[0].id} ${fastaRecords[0].description}`);
console.log(fastaRecords[0].sequence);

API Documentation

GenBank Format

parseGenBank(text: string): GenBankRecord

Parse GenBank format text into structured record.

const gbRecord = parseGenBank(genBankText);
console.log(gbRecord.locus);
console.log(gbRecord.features.length);
console.log(gbRecord.sequence);

genBankToFasta(record: GenBankRecord, includeFeatures?: boolean): FastaRecord[]

Convert GenBank record to FASTA format. Optionally extract CDS features as separate sequences.

// Main sequence only
const fastaRecords = genBankToFasta(gbRecord);

// Include CDS features
const allFeatures = genBankToFasta(gbRecord, true);

fastaToGenBank(record: FastaRecord, options?): string

Convert FASTA record to minimal GenBank format.

const fasta = { id: 'seq1', description: 'My sequence', sequence: 'ATGC' };
const genbank = fastaToGenBank(fasta, { organism: 'E. coli' });

EMBL Format

parseEMBL(text: string): EMBLRecord

Parse EMBL format text into structured record.

const emblRecord = parseEMBL(emblText);
console.log(emblRecord.id, emblRecord.accession);

emblToFasta(record: EMBLRecord): FastaRecord

Convert EMBL record to FASTA format.

const fasta = emblToFasta(emblRecord);
console.log(`>${fasta.id} ${fasta.description}\n${fasta.sequence}`);

GFF/GTF Format

parseGFFLine(line: string, version?: 'gff3' | 'gtf'): GFFRecord

Parse single GFF3 or GTF line.

const line = 'chr1\tHAVANA\tgene\t11869\t14409\t.\t+\t.\tID=gene1;Name=DDX11L1';
const record = parseGFFLine(line, 'gff3');
console.log(record.seqid, record.start, record.end);
console.log(record.attributes.ID);

formatGFFLine(record: GFFRecord, version?: 'gff3' | 'gtf'): string

Format GFF record as GFF3 or GTF line.

const line = formatGFFLine(record, 'gff3');

parseGFF(text: string, version?: 'gff3' | 'gtf'): GFFRecord[]

Parse complete GFF3/GTF file.

const records = parseGFF(gffText, 'gff3');
console.log(`Parsed ${records.length} annotations`);

formatGFF(records: GFFRecord[], version?: 'gff3' | 'gtf', includeHeader?: boolean): string

Format array of GFF records as complete file.

const gffText = formatGFF(records, 'gff3', true);

BED Format

parseBEDLine(line: string): BEDRecord

Parse single BED line (auto-detects BED3, BED6, BED12).

const bed6 = parseBEDLine('chr1\t1000\t2000\tfeature1\t500\t+');
console.log(bed6.chrom, bed6.chromStart, bed6.name);

formatBEDLine(record: BEDRecord): string

Format BED record as line (output format determined by available fields).

const line = formatBEDLine(record);

parseBED(text: string): BEDRecord[]

Parse complete BED file.

const records = parseBED(bedText);
console.log(`Parsed ${records.length} regions`);

formatBED(records: BEDRecord[], includeHeader?: boolean, trackName?: string): string

Format array of BED records as complete file.

const bedText = formatBED(records, true, 'myRegions');

VCF Format

parseVCFHeader(text: string): VCFHeader

Parse VCF header section.

const header = parseVCFHeader(vcfText);
console.log(header.fileformat, header.samples);

parseVCFLine(line: string, samples?: string[]): VCFRecord

Parse single VCF data line.

const record = parseVCFLine(line, header.samples);
console.log(record.chrom, record.pos, record.ref, record.alt);

parseVCF(text: string): { header: VCFHeader; records: VCFRecord[] }

Parse complete VCF file.

const { header, records } = parseVCF(vcfText);
console.log(`Parsed ${records.length} variants`);

formatVCFLine(record: VCFRecord): string

Format VCF record as line.

const line = formatVCFLine(record);

formatVCF(header: VCFHeader, records: VCFRecord[]): string

Format VCF header and records as complete file.

const vcfText = formatVCF(header, records);

SAM Format

decodeSAMFlags(flag: number): SAMFlags

Decode SAM bitwise flags into components.

const flags = decodeSAMFlags(163);
console.log(flags.paired, flags.properPair, flags.reverse);
// true, true, true

encodeSAMFlags(flags: Partial): number

Encode SAM flag components into bitwise integer.

const flag = encodeSAMFlags({
  paired: true,
  properPair: true,
  reverse: true,
  first: true
});
console.log(flag); // 163

parseSAMHeader(text: string): SAMHeader

Parse SAM header section.

const header = parseSAMHeader(samText);
console.log(header.version, header.references.length);

parseSAMLine(line: string): SAMRecord

Parse single SAM alignment line.

const record = parseSAMLine(line);
console.log(record.qname, record.pos, record.cigar);
console.log(record.flags.paired, record.flags.reverse);

parseSAM(text: string): { header: SAMHeader; records: SAMRecord[] }

Parse complete SAM file.

const { header, records } = parseSAM(samText);
console.log(`Parsed ${records.length} alignments`);

formatSAMLine(record: SAMRecord): string

Format SAM record as line.

const line = formatSAMLine(record);

formatSAM(header: SAMHeader, records: SAMRecord[]): string

Format SAM header and records as complete file.

const samText = formatSAM(header, records);

CIGAR Utilities

CIGAR (Compact Idiosyncratic Gapped Alignment Report) strings represent alignment operations in SAM/BAM files.

parseCIGAR(cigar: string): CigarOp[]

Parse CIGAR string into array of operations.

const ops = parseCIGAR('8M2I4M1D3M');
// [
//   { length: 8, operation: 'M' },
//   { length: 2, operation: 'I' },
//   { length: 4, operation: 'M' },
//   { length: 1, operation: 'D' },
//   { length: 3, operation: 'M' }
// ]

Supported operations:

  • M - Match/Mismatch
  • I - Insertion to reference
  • D - Deletion from reference
  • N - Skipped region (intron)
  • S - Soft clipping (clipped bases present in SEQ)
  • H - Hard clipping (clipped bases not present in SEQ)
  • P - Padding (silent deletion)
  • = - Sequence match
  • X - Sequence mismatch

formatCIGAR(operations: CigarOp[]): string

Format CIGAR operations back to string.

const ops = [
  { length: 8, operation: 'M' },
  { length: 2, operation: 'I' }
];
const cigar = formatCIGAR(ops); // '8M2I'

getCIGARStats(operations: CigarOp[]): CigarStats

Calculate alignment statistics from CIGAR operations.

const ops = parseCIGAR('8M2I4M1D3M');
const stats = getCIGARStats(ops);
console.log(stats);
// {
//   alignedLength: 18,
//   matches: 15,
//   mismatches: 0,
//   insertions: 2,
//   deletions: 1,
//   softClipped: 0,
//   hardClipped: 0,
//   referenceLength: 16,
//   queryLength: 17
// }

cigarToAlignedSequence(cigar: string, querySeq: string): { query: string; reference: string }

Generate aligned sequences with gaps from CIGAR and query sequence.

const aligned = cigarToAlignedSequence('3M2I2M1D2M', 'ACGTACGTA');
console.log(aligned.query);      // 'ACGTACG-TA'
console.log(aligned.reference);  // 'ACG--CGNTA'

validateCIGAR(cigar: string): boolean

Validate CIGAR string format.

validateCIGAR('8M2I4M1D3M'); // true
validateCIGAR('8M2Q4M');     // false (invalid operation Q)
validateCIGAR('*');          // true (unmapped)

Newick Tree Format

Newick format is standard for representing phylogenetic trees.

parseNewick(newick: string): NewickTree

Parse Newick format string into tree structure.

const tree = parseNewick('((A:0.1,B:0.2):0.3,C:0.4);');
console.log(tree.leafCount);   // 3
console.log(tree.maxDepth);    // 2
console.log(tree.hasLengths);  // true

Tree structure:

interface NewickTree {
  root: NewickNode;
  leafCount: number;
  maxDepth: number;
  hasLengths: boolean;
}

interface NewickNode {
  name?: string;
  length?: number;
  children?: NewickNode[];
}

formatNewick(tree: NewickTree, options?): string

Format tree structure back to Newick string.

const tree = parseNewick('((A:0.1,B:0.2):0.3,C:0.4);');
const newick = formatNewick(tree); // '((A:0.100000,B:0.200000):0.300000,C:0.400000);'

// Without branch lengths
const simple = formatNewick(tree, { includeLengths: false }); // '((A,B),C);'

// Custom precision
const short = formatNewick(tree, { precision: 2 }); // '((A:0.10,B:0.20):0.30,C:0.40);'

countLeaves(node: NewickNode): number

Count leaf nodes in tree.

const tree = parseNewick('((A,B),(C,D));');
const count = countLeaves(tree.root); // 4

calculateDepth(node: NewickNode): number

Calculate maximum tree depth (leaves are at depth 0).

const tree = parseNewick('(((A,B),C),D);');
const depth = calculateDepth(tree.root); // 3

getLeafNames(node: NewickNode): string[]

Get all leaf names from tree.

const tree = parseNewick('((A:0.1,B:0.2):0.3,C:0.4);');
const leaves = getLeafNames(tree.root); // ['A', 'B', 'C']

getTotalLength(node: NewickNode): number

Calculate total tree length (sum of all branch lengths).

const tree = parseNewick('((A:0.1,B:0.2):0.3,C:0.4);');
const totalLength = getTotalLength(tree.root); // 1.0

validateNewick(newick: string): boolean

Validate Newick format string.

validateNewick('((A,B),C);');  // true
validateNewick('((A,B),C)');   // false (missing semicolon)
validateNewick('((A,B);');     // false (unmatched parentheses)

Real-World Examples

Extract CDS Sequences from GenBank

import { parseGenBank, genBankToFasta } from '@bioscript/seq-format';
import { readFileSync, writeFileSync } from 'fs';

const gbText = readFileSync('genome.gb', 'utf-8');
const gbRecord = parseGenBank(gbText);

// Extract all CDS features
const fastaRecords = genBankToFasta(gbRecord, true);
const cdsRecords = fastaRecords.slice(1); // Skip main sequence

// Write to FASTA file
const fastaText = cdsRecords
  .map(r => `>${r.id} ${r.description}\n${r.sequence}`)
  .join('\n');

writeFileSync('cds_sequences.fasta', fastaText);

Filter VCF by Quality

import { parseVCF, formatVCF } from '@bioscript/seq-format';

const { header, records } = parseVCF(vcfText);

// Filter high-quality variants (QUAL > 30)
const highQualityVariants = records.filter(r => r.qual && r.qual > 30);

// Write filtered VCF
const filteredVcf = formatVCF(header, highQualityVariants);
writeFileSync('high_quality.vcf', filteredVcf);

Convert GFF3 to GTF

import { parseGFF, formatGFF } from '@bioscript/seq-format';

const gff3Records = parseGFF(gff3Text, 'gff3');

// Convert to GTF format
const gtfText = formatGFF(gff3Records, 'gtf', true);
writeFileSync('annotations.gtf', gtfText);

Analyze SAM Alignment Flags

import { parseSAM } from '@bioscript/seq-format';

const { header, records } = parseSAM(samText);

// Count paired-end reads
const pairedReads = records.filter(r => r.flags?.paired).length;
console.log(`Paired reads: ${pairedReads} / ${records.length}`);

// Count unmapped reads
const unmappedReads = records.filter(r => r.flags?.unmapped).length;
console.log(`Unmapped: ${unmappedReads} (${(unmappedReads / records.length * 100).toFixed(1)}%)`);

// Count reverse strand alignments
const reverseReads = records.filter(r => r.flags?.reverse).length;
console.log(`Reverse strand: ${reverseReads}`);

Extract Gene Regions from BED

import { parseBED } from '@bioscript/seq-format';

const bedRecords = parseBED(bedText);

// Filter genes on chromosome 1
const chr1Genes = bedRecords.filter(r => r.chrom === 'chr1');

// Find large genes (>10kb)
const largeGenes = bedRecords.filter(r => 
  (r.chromEnd - r.chromStart) > 10000
);

console.log(`Large genes: ${largeGenes.length}`);
largeGenes.forEach(gene => {
  const size = gene.chromEnd - gene.chromStart;
  console.log(`${gene.name}: ${size}bp on ${gene.strand}`);
});

Analyze Alignment Quality with CIGAR

import { parseSAM, parseCIGAR, getCIGARStats } from '@bioscript/seq-format';

const { records } = parseSAM(samText);

// Analyze alignment quality
records.forEach(record => {
  if (record.cigar && record.cigar !== '*') {
    const ops = parseCIGAR(record.cigar);
    const stats = getCIGARStats(ops);
    
    const matchRate = stats.matches / stats.alignedLength;
    const indelRate = (stats.insertions + stats.deletions) / stats.alignedLength;
    
    console.log(`${record.qname}:`);
    console.log(`  Match rate: ${(matchRate * 100).toFixed(1)}%`);
    console.log(`  Indel rate: ${(indelRate * 100).toFixed(1)}%`);
    console.log(`  Reference span: ${stats.referenceLength}bp`);
  }
});

Visualize CIGAR Alignment

import { cigarToAlignedSequence } from '@bioscript/seq-format';

const cigar = '5M2I3M1D4M';
const query = 'ACGTACCGTACGT';

const aligned = cigarToAlignedSequence(cigar, query);

console.log('Query:    ', aligned.query);
console.log('Reference:', aligned.reference);
// Query:     ACGTACCGTACGT
// Reference: ACGTA--CGT-ACGT

Build and Analyze Phylogenetic Tree

import { parseNewick, getLeafNames, calculateDepth, getTotalLength } from '@bioscript/seq-format';

// Parse tree with branch lengths
const tree = parseNewick('((Human:0.2,Chimp:0.15):0.1,(Mouse:0.3,Rat:0.25):0.15);');

console.log(`Species: ${tree.leafCount}`);
console.log(`Tree depth: ${tree.maxDepth}`);
console.log(`Total branch length: ${getTotalLength(tree.root).toFixed(2)}`);

// Get all species names
const species = getLeafNames(tree.root);
console.log(`Species list: ${species.join(', ')}`);
// Species list: Human, Chimp, Mouse, Rat

Convert Tree Formats

import { parseNewick, formatNewick } from '@bioscript/seq-format';

// Parse tree with high precision
const tree = parseNewick('(A:0.123456,B:0.987654);');

// Format with different precisions
const fullPrecision = formatNewick(tree, { precision: 6 });
const lowPrecision = formatNewick(tree, { precision: 2 });
const noLengths = formatNewick(tree, { includeLengths: false });

console.log(fullPrecision); // '(A:0.123456,B:0.987654);'
console.log(lowPrecision);  // '(A:0.12,B:0.99);'
console.log(noLengths);     // '(A,B);'

Performance

  • GenBank parsing: 1-10KB files in <1ms, 100KB-1MB genomes in 5-50ms
  • GFF/GTF parsing: 10K lines in ~50ms, 100K lines in ~500ms
  • BED parsing: 10K regions in ~30ms, 100K regions in ~300ms
  • VCF parsing: 10K variants in ~100ms, 100K variants in ~1s
  • SAM parsing: 10K alignments in ~150ms, 100K alignments in ~1.5s
  • Memory: O(n) where n is input size, no streaming (use seq-stream for large files)

All parsers use O(n) algorithms with efficient string processing.

Format Support

Fully Supported

  • ✅ GenBank (parse, write, convert to/from FASTA)
  • ✅ EMBL (parse, convert to FASTA)
  • ✅ GFF3 (parse, write)
  • ✅ GTF (parse, write)
  • ✅ BED3/BED6/BED12 (parse, write)
  • ✅ VCF (parse, write)
  • ✅ SAM (parse, write, flag decoding)
  • ✅ CIGAR (parse, format, analyze, validate)
  • ✅ Newick (parse, format, tree analysis)

Limitations

  • BAM: Binary format not supported (use samtools for BAM ↔ SAM conversion)
  • Streaming: Not optimized for multi-GB files (see @bioscript/seq-stream for streaming)
  • GenBank features: Simplified location parsing (join/complement partially supported)

TypeScript Types

All parsers return fully typed objects:

interface GenBankRecord {
  locus: string;
  definition: string;
  accession: string;
  version: string;
  keywords: string;
  source: string;
  organism: string;
  references: string[];
  features: GenBankFeature[];
  sequence: string;
}

interface GFFRecord {
  seqid: string;
  source: string;
  type: string;
  start: number;
  end: number;
  score: number | null;
  strand: '+' | '-' | '.' | '?';
  phase: 0 | 1 | 2 | null;
  attributes: Record<string, string | string[]>;
}

// ... and more (see types.ts for complete definitions)

Error Handling

All parsers throw descriptive errors:

try {
  const record = parseGenBank(invalidText);
} catch (error) {
  console.error(error.message);
  // "Invalid GenBank format: missing LOCUS"
}

License

MIT © 2026 Mykyta Forofontov