corpy.morphodita
¶
Convenient and easy-to-use MorphoDiTa wrappers.
-
class
corpy.morphodita.
Tokenizer
(tokenizer_type)¶ A wrapper API around the tokenizers offered by MorphoDiTa.
- Parameters
tokenizer_type (str) – Type of the requested tokenizer (cf. below for possible values).
tokenizer_type is typically one of:
"czech"
: a tokenizer tuned for Czech"english"
: a tokenizer tuned for English"generic"
: a generic tokenizer"vertical"
: a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)
Specifically, the available tokenizers are determined by the
new_*_tokenizer
static methods on the MorphoDiTatokenizer
class described in the MorphoDiTa API reference.-
static
from_tagger
(tagger_path)¶ Load tokenizer associated with tagger file.
-
tokenize
(text, sents=False)¶ Tokenize text.
- Parameters
text (str) – Text to tokenize.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).
- Returns
An iterator over the tokenized text, possibly grouped into sentences if
sents=True
.
Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:
>>> from corpy.morphodita import Tokenizer >>> t = Tokenizer("generic") >>> for word in t.tokenize("foo bar baz"): ... print(word) ... foo bar baz
If you want to iterate over sentences (lists of tokens), set
sents=True
:>>> for sentence in t.tokenize("foo bar baz", sents=True): ... print(sentence) ... ['foo', 'bar', 'baz']
-
class
corpy.morphodita.
Tagger
(tagger_path)¶ A MorphoDiTa morphological tagger and lemmatizer.
- Parameters
tagger_path (str) – Path to the pre-compiled tagging models to load.
-
tag
(text, *, sents=False, guesser=False, convert=None)¶ Perform morphological tagging and lemmatization on text.
If
text
is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables (typically a list of lists), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.- Parameters
text (either str (tokenization is left to the tagger) or iterable of iterables (of str), representing individual sentences) – Input text.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).
guesser (bool) – Whether to use the morphological guesser provided with the tagger (if available).
convert (str, one of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required) – Conversion strategy to apply to lemmas and / or tags before outputting them.
- Returns
An iterator over the tagged text, possibly grouped into sentences if
sents=True
.
>>> tagger = Tagger("./czech-morfflex-pdt-161115.tagger") >>> from pprint import pprint >>> tokens = list(tagger.tag("Je zima. Bude sněžit.")) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']])) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True)) >>> pprint(sents) [[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------')], [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')]]