CorPy#

Installation#

$ python3 -m pip install corpy

Only recent versions of Python 3 (3.10+) are supported by design.

Help and feedback#

If you get stuck, it’s always a good idea to start by searching the documentation, the short URL to which is https://corpy.rtfd.io/.

The project is developed on GitHub. You can ask for help via GitHub discussions and report bugs and give other kinds of feedback via GitHub issues. Support is provided gladly, time and other engagements permitting, but cannot be guaranteed.

What is CorPy?#

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

Here’s an idea of what you can do with CorPy:

add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa
easily generate word clouds
run code in a sanitized global environment (useful for debugging in interactive sessions, e.g. with Jupyter notebooks in JupyterLab)
generate phonetic transcripts of Czech texts
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities

Note

Should I pick UDPipe or MorphoDiTa?

Both are developed at ÚFAL MFF UK. UDPipe has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward – it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don’t have a language model available.

User guides

API reference

License#

Distributed under the GNU General Public License v3.

CorPy#

Installation#

Help and feedback#

What is CorPy?#

License#

Indices and tables#