CorPy¶

Installation¶

$ pip3 install corpy

Only recent versions of Python 3 (3.6+) are supported by design.

What is CorPy?¶

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

The short URL to the docs is: https://corpy.rtfd.io/

Here’s an idea of what you can do with CorPy:

add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa

Note

Should I pick UDPipe or MorphoDiTa?

UDPipe is the successor to MorphoDiTa, extending and improving upon the original codebase. It has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward – it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don’t have a language model available.

easily generate word clouds
generate phonetic transcripts of Czech texts
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities

User guides

API reference

License¶

Distributed under the GNU General Public License v3.

CorPy¶

Installation¶

What is CorPy?¶

License¶

Indices and tables¶