`corpy.morphodita`¶

Convenient and easy-to-use MorphoDiTa wrappers.

class corpy.morphodita.Tokenizer(tokenizer_type)¶

A wrapper API around the tokenizers offered by MorphoDiTa.

Parameters: tokenizer_type (str) – Type of the requested tokenizer (cf. below for possible values).

tokenizer_type is typically one of:

"czech": a tokenizer tuned for Czech
"english": a tokenizer tuned for English
"generic": a generic tokenizer
"vertical": a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)

Specifically, the available tokenizers are determined by the new_*_tokenizer static methods on the MorphoDiTa tokenizer class described in the MorphoDiTa API reference.

static from_tagger(tagger_path)¶: Load tokenizer associated with tagger file.

tokenize(text, sents=False)¶

Tokenize text.

Parameters

text (str) – Text to tokenize.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).

Returns

An iterator over the tokenized text, possibly grouped into sentences if sents=True.

Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:

>>> from corpy.morphodita import Tokenizer
>>> t = Tokenizer("generic")
>>> for word in t.tokenize("foo bar baz"):
...     print(word)
...
foo
bar
baz

If you want to iterate over sentences (lists of tokens), set sents=True:

>>> for sentence in t.tokenize("foo bar baz", sents=True):
...     print(sentence)
...
['foo', 'bar', 'baz']

class corpy.morphodita.Token(word, lemma, tag)¶

word: str¶: Alias for field number 0

lemma: str¶: Alias for field number 1

tag: str¶: Alias for field number 2

class corpy.morphodita.Tagger(tagger_path: Union[pathlib.Path, str])¶

A MorphoDiTa morphological tagger and lemmatizer.

Parameters: tagger_path (str or pathlib.Path) – Path to the pre-compiled tagging models to load.

tag(text, *, sents=False, guesser=False, convert=None) → Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]¶

Perform morphological tagging and lemmatization on text.

If text is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables (typically a list of lists), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.

Parameters

text (either str (tokenization is left to the tagger) or iterable of iterables (of str), representing individual sentences) – Input text.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).
guesser (bool) – Whether to use the morphological guesser provided with the tagger (if available).
convert (str, one of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required) – Conversion strategy to apply to lemmas and / or tags before outputting them.

Returns

An iterator over the tagged text, possibly grouped into sentences if sents=True.

>>> tagger = Tagger("./czech-morfflex-pdt-161115.tagger")
>>> from pprint import pprint
>>> tokens = list(tagger.tag("Je zima. Bude sněžit."))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
 Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']]))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
 Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True))
>>> pprint(sents)
[[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
  Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
  Token(word='.', lemma='.', tag='Z:-------------')],
 [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
  Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
  Token(word='.', lemma='.', tag='Z:-------------')]]

tag_untokenized(text, sents=False, guesser=False, convert=None) → Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]¶: This is the method tag() delegates to when text is a string. See docstring for tag() for details about parameters.

tag_tokenized(text, sents=False, guesser=False, convert=None) → Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]¶: This is the method tag() delegates to when text is an iterable of iterables of strings. See docstring for tag() for details about parameters.

corpy.morphodita¶

`corpy.morphodita`¶