penn-treebank-sample

v0.0.2

Published

4 years ago

a non-commercial, fair-use subset of the penn-treebank, in JSON

0High
0Medium
0Low

spencermountain

treebank

a small sample of PENN treebank part-of-speech tagged english dataset, with tags from the nlp-compromise tagset.

simply a transformation of the fair-use subset of the Penn Treebank by the NLTK library, with cosmetic formatting changes for javascript-use.

This data is for non-commercial fair-use only, and all users are encouraged to purchase a license of the full dataset for any commercial projects.

data is (only) 4,000 tagged sentences, with compromise tag-mappings, and some opinionated lumping of punctuation, contractions, etc.

972kb uncompressed.

sample:

{ text: 'Another OTC bank stock involved in a buy-out deal, First Constitution Financial, was higher.',
  tags:
   [ 'Determiner',
     'Noun',
     'Noun',
     'Noun',
     'Verb',
     'Preposition',
     'Determiner',
     'Noun',
     'Noun',
     'Noun',
     'Noun',
     'Noun',
     'Verb',
     'Comparative'
   ]
}

Original statement in NLTK:

Copyright (C) 1995 University of Pennsylvania;
This is a 10% fragment of Penn Treebank, (C) LDC 1995, which has been dependency parsed.
It is made available under fair use for the purposes of illustrating NLTK tools for tokenizing, tagging, chunking and parsing.
This data is for non-commercial use only.;

please file an issue if there are any copyright concerns in placing this on npm or github.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

penn-treebank-sample

v0.0.2

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme