CorPy

Documentation status PyPI package Code style

Installation

$ pip3 install corpy

Only recent versions of Python 3 (3.6+) are supported by design.

What is CorPy?

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

The short URL to the docs is: https://corpy.rtfd.io/

Here’s an idea of what you can do with CorPy:

  • add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa

Note

Should I pick UDPipe or MorphoDiTa?

UDPipe is the successor to MorphoDiTa, extending and improving upon the original codebase. It has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward – it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don’t have a language model available.

License

Copyright © 2016–present ÚČNK/David Lukeš

Distributed under the GNU General Public License v3.

Indices and tables