CorPy¶
What is CorPy?¶
A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.
The short URL to the docs is: https://corpy.rtfd.io/
Here’s an idea of what you can do with CorPy:
add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa
Note
Should I pick UDPipe or MorphoDiTa?
UDPipe is the successor to MorphoDiTa, extending and improving upon the original codebase. It has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.
By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward – it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.
MorphoDiTa can also help you if you just want to tokenize text and don’t have a language model available.
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities
License¶
Copyright © 2016–present ÚČNK/David Lukeš
Distributed under the GNU General Public License v3.