cgparse
v1.1.2
Published
Converts sequence files (GenBank, EMBL, FASTA) and feature files (GFF3, GTF, BED, CSV) to JSON and CGView JSON
Maintainers
Readme
CGParse.js
CGParse.js is a lightweight JavaScript library for parsing biological sequence and feature files (GenBank, EMBL, FASTA, GFF3, BED, etc.). It converts these files into CGView-compatible JSON, making them ready for visualization with CGView.js.
Table of Contents
- Installation
- Quick Start
- Intermediate Sequence and Feature JSON formats
- Logger
- Live Test Page
- Development
- Resources
Installation
CGParse.js has no dependencies. Install it from npm or load it directly via jsDelivr.
Installing from npm
npm install cgparse
# or
yarn add cgparseInstalling as a script from jsDelivr
<script src="https://cdn.jsdelivr.net/npm/cgparse/dist/cgparse.min.js"></script>
<!-- CGParse will be available as a global variable -->Quick Start
Create CGView JSON from a sequence file (e.g. GenBank, EMBL, FASTA)
import * as CGParse from 'cgparse'
// Create a string for the sequence file.
// The sequence string can be in GenBank, EMBL or FASTA format.
// Here we show a FASTA example for brevity.
const seqString = `>Sequence_Example
atgtccgacccaccggtcgaaaaaacaagtccagagaagacagagggctcttcatccggcagttttcgcgtcccactcgaatccgacaagcttggtgacccggatttcaagccatgcgtt
gcacaaatcacaatggagagaaacgcagtgttcgaggacaacttcctcgatcggagacagagtgctcgcgccgtaatcgagtattgctttgaggacgaaatgcaaaacttggttgaaggg
cgacctgctgtttcagaagaaccagttgttcccatacgattccgacgcccaccaccgtccggacctgctcacgacgtctttggcgacgccatgaacgaaattttccagaaattaatgatg
aaaggtcaatgcgcagacttctgccactggatggcttattggctaacaaaggaacaagatgatgcgaatgatggattttttggcaatattcgctataatccagatgtctatgtcacggaa
ggcacaacagaaaccaaaaaggcgttcgtcgacagcatgtggccgactgctcagcgaattcttctgaaatccgtccggaacagcacgattttacgcacaaagtggactggaatccacgtg
tcagcggatcagttgaaggggcaacgcccgaagcaagaagatagattcgtagcttatccgaatagtcagtatatgaatcggacgcaggatcccgtcgcccttctcggtgtgttcgatggt
catggcggacacgagtgctcacaatacgcggcctctcacttctgggaggcatggctggagactcgacaaactagcgacggtgatgagctccagaatcagctgaagaagtcacttgagttg
ttggatcaacgattgacagtcagaagtgtgaaggaatactggaagggtggcactacggcggcgtgttgcgctatcgataaggagaacaaaacgatggcgttcgcgtggttgggcgattca
ccaggatacgtcatgaacaacatggaattccgcaaagtga`;
// Create a new CGView builder for the sequence
const cgvBuilder = new CGParse.CGViewBuilder(seqString);
// Convert to CGView JSON
const cgviewJSON = cgvBuilder.toJSON();
// The JSON can then be loaded into a previously created CGView.js instance
cgv.io.loadJSON(cgviewJSON);Details on the CGView JSON format Tutorial on using CGParse.js with CGView.js
Create CGView Features from a feature file (e.g. CSV, GFF3, GTF, BED)
import * as CGParse from 'cgparse'
// Create a string for the feature file (e.g GFF3)
const gff3String = `##gff-version 3
chr1 . gene 1000 2000 . + . ID=gene1;Name=myGene`;
// Create a new feature builder for the features
const featureBuilder = new CGParse.FeatureBuilder(gff3String);
// Convert to an array of CGView features
const features = featureBuilder.toJSON();
// The features can then be added to a previously created CGView instance
cgv.addFeatures(features);CGViewBuilder Options
const builder = new CGParse.CGViewBuilder(seqString, {
// CGView configuration (see below for example)
config: configJSON,
// The order of preference for the naming of a feature [Default shown]
// The name for a feature will be taken from the first feature qualifier
// in the list that has a value.
// TODO: CONFIRM
nameKeys: ['gene', 'locus_tag', 'product', 'note', 'db_xref'],
// Feature types to exclude (see below for details on including/excluding features)
excludeFeatures: ['gene', 'source', 'exon'],
// Qualifiers types to exclude (see below for details on including/excluding features)
excludeQualifiers: ['translation'],
});Including/Excluding Feature Types and Qualifiers
When building from a GenBank or EMBL file, you can choose which features types (e.g. CDS, gene, rRNA) and qualifiers (e.g. product, note, locus_tag) to include or exclude. The default is to include all features and all their qualifiers.
// Include all features and their qualifiers [Default]
const builder = new CGParse.CGViewBuilder(seqString, {
includeFeatures: true, // [Default]
includeQualifiers: true // [Default]
})
// Include no features
const builder = new CGParse.CGViewBuilder(seqString, {
includeFeatures: false,
includeQualifiers: false // [Not required, since there will not be any features]
})
// Include only specific features and their qualifiers
const builder = new CGParse.CGViewBuilder(seqString, {
includeFeatures: ['CDS', 'rRNA'],
includeQualifiers: ['product', 'note', 'locus_tag']
})
// Exclude a subset of features and their qualifiers
// Recommended settings for bacterial genomes
const builder = new CGParse.CGViewBuilder(seqString, {
excludeFeatures: ['source', 'gene', 'exon'],
excludeQualifiers: ['translation']
})For list of qualifiers and feature types see the following resources:
- Qualifiers List from INSDC
- Qualifiers Local List
- Feature Types List from INSDC
- Feature Types Local List
CGViewBuilder Config
A config object can be provided to CGViewBuilder. The config is a JSON object with options that are added to the CGView JSON. They can be any settings available for the following components: settings, backbone, ruler, dividers, annotation, sequence, legend, track, captions
// Example Config
let configJSON = {
"settings": {
"backgroundColor": "white",
"showShading": true,
"arrowHeadLength": 0.3
},
"ruler": {
"font": "sans-serif, plain, 10",
"color": "black"
},
"legend": {
"position": "top-right",
"defaultFont": "sans-serif, plain, 14",
"items": [
{
"name": "CDS",
"swatchColor": "rgba(0,0,153,0.5)",
"decoration": "arrow"
}
]
},
"tracks": [
{
"name": "CG Content",
"thicknessRatio": 2,
"position": "inside",
"dataType": "plot",
"dataMethod": "sequence",
"dataKeys": "gc-content"
},
{
"name": "CG Skew",
"thicknessRatio": 2,
"position": "inside",
"dataType": "plot",
"dataMethod": "sequence",
"dataKeys": "gc-skew"
}
],
captions: [
{
// For name you can provide static text or use special keywords to get dynamic text
name: "my map", // Shows 'my map' as the caption
// name: "DEFINITION", // Shows the sequence definition (e.g. Escherichia coli str. K-12 substr. MG1655, complete genome.)
// name: "ID", // Shows the sequence accession/version (e.g. NC_000913.3)
position: "bottom-center",
},
]
}Intermediate Sequence and Feature JSON formats
Internally CGViewBuilder and FeatureBuilder are first taking the input file and converting it to an intermediate JSON format that contains most of the data from the input file. This intermediate format could be used by other projects.
Sequence Files (GenBank, EMBL, FASTA)
import * as CGParse from 'cgparse';
// Parse sequence file (e.g. GenBank Accession AF177870.1)
// Showing truncated GenBank text here (full GenBank text below)
const genbankText = `LOCUS AF177870 3123 bp DNA...`;// Parse GenBank
const seqFile = new CGParse.SequenceFile(genbankText);
// Summary
seqFile.summary;
// {
// inputType: 'genbank',
// sequenceType: 'dna',
// sequenceCount: 1,
// featureCount: 4,
// totalLength: 3123,
// status: 'success'
// }
// Array of parsed records as JSON
seqFile.records;Records Output
[
{
"inputType": "genbank",
"name": "AF177870",
"seqID": "AF177870.1",
"definition": "Caenorhabditis sp. CB5161 putative PP2C protein phosphatase FEM-2",
"length": 3123,
"topology": "linear",
"type": "dna"
"comments": "",
"sequence": "gaacgcgaatgcctctctctctttcgatgggtatgccaattgtccacattcactcgtgttgcctcctctttgccaacacgcaagacaccagaaacgcgtcaaccaaagagaaaaagacgccgacaacgggcagcactcgcgagagacaaaggttatcgcgttgtgttattatacattcgcatccgggtcaactttagtccgttgaacatgcttcttgaaaacctagttctcttaaaataacgttttagaagttttggtcttcagatgtctgattcgctaaatcatccatcgagttctacggtgcatgcagatgatggattcgagccaccaacatctccggaagacaacaacaaaaaaccgtctttagaacaaattaaacaggaaagagaagcgttgtttacggttagttacctattagctgcaagttttgaaaaagcggaatctgtaaaaagcggaatctgtaaaaaaaacatctaaggaataattctgaaaagaaaaagtttctaaatgttaatcggaatccaatttttatgaaattatttaaaaaaaaactaaaattagtttctaaaaaatttttctaaagtaattggaccatgtgaaggtacacccacttgttccaatatgccatatctaactgtaaaataatttgattctcatgagaatatttttcaggatctattcgcagatcgtcgacgaagcgctcgttctgtgattgaagaagctttccaaaacgaactcatgagtgctgaaccagtccagccaaacgtgccgaatccacattgtgagttggaaatttttatttgataaccaagagaaaaaaagttctacctttttttcaaaaacctttccaaaaatgattccatctgatataggattaagaaaaatattttccgaaatctctgcttttcagcgattcccattcgtttccgtcatcaaccagttgctggacctgctcatgatgttttcggagacgcggtgcattcaatttttcaaaaaataatgtccaggtatacactatttttgcatatttttcttgccaaatttggtcaaaaaccgtagtacaacccaaaaagtttcttcatttcagaggagtgaacgcggattatagtcattggatgtcatattggatcgcgttgggaatcgacaaaaaaacacaaatgaactatcatatgaaaccgttttgcaaagatacttatgcaactgaaggctccttaggtaggttagtcttttctaggcacagaagagtgagaaaattctaaatttctgagcagtctgctttttgttttccttgagtttttacttaaagctcttaaaagaaatctaggcgtgaagttcgagccttgtaccataccacaacagcattccaaatgttacagaagcgaaacaaacatttactgataaaatcaggtcagctgttgaggaaattatctggaagtccgctgaatattgtgatattcttagcgagaagtggacaggaattcatgtgtcggccgaccaactgaaaggtcaaagaaataagcaagaagatcgttttgtggcttatccaaatggacaatacatgaatcgtggacaggttagtgcgaatcggggactcaagatttactgaaatagtgaagagaaaacaaaagaaaactatattttcaaaaaaaatgagaactctaataaacagaatgaaaaacattcaaagctacagtagtatttccagctggagtttccagagccaaaaaaatgcgagtattactgtagttttgaaattggtttctcactttacgtacgattttttgatttttttttcagactcttcatatgaaaaaaaatcatgttttctcctttacaagatttttttgatctcaaaacatttccagagtgacatttcacttcttgcggtgttcgatgggcatggcggacacgagtgctctcaatatgcagctgctcatttctgggaagcatggtccgatgctcaacatcatcattcacaagatatgaaacttgacgaactcctagaaaaggctctagaaacattggacgaaagaatgacagtcagaagtgttcgagaatcttggaaaggtggaaccactgctgtctgctgtgctgttgatttgaacactaatcaaatcgcatttgcctggcttggagattcaccagggtaatcaatttttttttagtttttggaactttacgtcccgaaaaattattcctttatcacctaattcctacagtaacccaagctccgaattaaataaagttaaagcgtggtatacacataaaaataagaaaaaattgttcatgaaatccatttttccagttacatcatgtcaaacttggagttccgcaaattcactactgaacactccccgtctgacccggaggaatgtcgacgagtcgaagaagtcggtggccagatttttgtgatcggtggtgagctccgtgtgaatggagtactcaacctgacgcgagcactaggagacgtacctggaagaccaatgatatccaacaaacctgataccttactgaagacgatcgaacctgcggattatcttgttttgttggcctgtgacgggatttctgacgtcttcaacactagtgatttgtacaatttggttcaggcttttgtcaatgaatatgacgtagaaggtatcaaactgatcgtttttcacatcacaaaattcttgaattttccagattatcacgaacttgcacgctacatttgcaatcaagcagtttcagctggaagtgctgacaatgtgacagtagttataggtttcctccgtccaccagaagacgtttggcgtgtaatgaaaacagactcggatgatgaagagagcgagctcgaggaagaagatgacaatgaatagtttattgcaagttttccaaaacttttccaatttccctgggtattgattagcatccatatcttacggcgattatatcaattgtaacattatttctgtttctccccccacctctcaaattttcaaatgaccctttttcttttcgtctacctgtatcgttttccattcatctccccccctccactgtggtatatcattttgtcattagaaagtattattttgattttcattggcagtagaagacaacaggatacagaagaggttttcacag",
"features": [
{
"type": "source",
"strand": 1,
"locationText": "1..3123",
"locations": [[1,3123]],
"start": 1,
"stop": 3123,
"qualifiers": {
"organism": "Caenorhabditis brenneri",
"mol_type": "genomic DNA",
"strain": "CB5161",
"db_xref": "taxon:135651"
},
"name": "taxon:135651"
},
{
"type": "gene",
"strand": 1,
"locationText": "<265..>2855",
"locations": [[265,2855]],
"start": 265,
"stop": 2855,
"qualifiers": {
"gene": "fem-2"
},
"name": "fem-2"
},
{
"type": "mRNA",
"strand": 1,
"locationText": "join(<265..402,673..781,911..1007,1088..1215,1377..1573,1866..2146,2306..2634,2683..>2855)",
"locations": [[265,402],[673,781],[911,1007],[1088,1215],[1377,1573],[1866,2146],[2306,2634],[2683,2855]],
"start": 265,
"stop": 2855,
"qualifiers": {
"gene": "fem-2",
"product": "putative FEM-2 protein phosphatase type 2C"
},
"name": "fem-2"
},
{
"type": "CDS",
"strand": 1,
"locationText": "join(265..402,673..781,911..1007,1088..1215,1377..1573,1866..2146,2306..2634,2683..2855)",
"locations": [[265,402],[673,781],[911,1007],[1088,1215],[1377,1573],[1866,2146],[2306,2634],[2683,2855]],
"start": 265,
"stop": 2855,
"qualifiers": {
"gene": "fem-2",
"note": "possible sex-determining protein",
"codon_start": "1",
"product": "putative PP2C protein phosphatase FEM-2",
"protein_id": "AAF04557.1",
"translation": "MSDSLNHPSSSTVHADDGFEPPTSPEDNNKKPSLEQIKQEREALFTDLFADRRRSARSVIEEAFQNELMSAEPVQPNVPNPHSIPIRFRHQPVAGPAHDVFGDAVHSIFQKIMSRGVNADYSHWMSYWIALGIDKKTQMNYHMKPFCKDTYATEGSLEAKQTFTDKIRSAVEEIIWKSAEYCDILSEKWTGIHVSADQLKGQRNKQEDRFVAYPNGQYMNRGQSDISLLAVFDGHGGHECSQYAAAHFWEAWSDAQHHHSQDMKLDELLEKALETLDERMTVRSVRESWKGGTTAVCCAVDLNTNQIAFAWLGDSPGYIMSNLEFRKFTTEHSPSDPEECRRVEEVGGQIFVIGGELRVNGVLNLTRALGDVPGRPMISNKPDTLLKTIEPADYLVLLACDGISDVFNTSDLYNLVQAFVNEYDVEDYHELARYICNQAVSAGSADNVTVVIGFLRPPEDVWRVMKTDSDDEESELEEEDDNE"
},
"name": "fem-2"
}
],
}
]// The sequence file can be directly converted to CGView JSON
const cgvJSON = seqFile.toCGViewJSON();
// Or passed to the builder
const builder = new CGParse.CGViewBuilder(seqFile);
const cgvJSON = builder.toJSON();Feature Files (GFF3, BED, GTF)
import * as CGParse from 'cgparse';
const gff3Text = `##gff-version 3
chr1 . gene 1000 2000 . + . ID=gene1;Name=myGene`;
const featureFile = new CGParse.FeatureFile(gff3Text);
// Summary
featureFile.summary;
// {
// inputFormat: "gff3",
// featureCount: 1,
// status: "success"
// }
// Array of parsed features as JSON
featureFile.records;Records Output
[
{
"contig": "chr1",
"source": ".",
"type": "gene",
"start": 1000,
"stop": 2000,
"score": ".",
"strand": "+",
"phase": ".",
"attributes": {
"ID": "gene1",
"Name": "myGene"
},
"qualifiers": {},
"valid": true,
"name": "myGene"
}
]// The feature file can be directly converted to CGView features array
const cgvFeatures = featureFile.toCGViewFeaturesJSON()
// Or passed to the builder
const builder = new CGParse.FeatureBuilder(featureFile);
const featuresJSON = builder.toJSON();Logger
The main classes (CGViewBuilder, SequenceFile, FeatureFile, and FeatureBuilder) contain a custom Logger with levels, icons, timestamps, and history. The logger can be access via the logger property on any instance.
The Logger can also be used on its own:
const logger = new CGParse.Logger({
logToConsole: true,
showTimestamps: true,
showIcons: true,
maxLogCount: undefined // No limit
});
logger.info('Processing started'); // ℹ️ Processing started
logger.warn('Invalid feature found'); // ⚠️ Invalid feature found
logger.error('Parse failed'); // 🛑 Parse failed
console.log(logger.count); // Total messages logged
console.log(logger.history()); // Full log history as a stringLive Test Page
The test page lets you upload or choose example files, view intermediate JSON, final CGView JSON, rendered maps, log output, and open results in Proksee.
Development
# Install dependencies
yarn install
# Run tests
yarn test
# Build for distribution
yarn buildResources
- CGView.js - Circular genome viewer
- Proksee - Online genome visualization platform
- seq_to_json.py - Paul Stothard's sequence file Python parser
- EMBL Feature Table - Feature format reference
