corpy.morphodita

Convenient and easy-to-use MorphoDiTa wrappers.

class corpy.morphodita.Tokenizer(tokenizer_type)

A wrapper API around the tokenizers offered by MorphoDiTa.

Parameters

tokenizer_type (str) – Type of the requested tokenizer (cf. below for possible values).

tokenizer_type is typically one of:

  • "czech": a tokenizer tuned for Czech

  • "english": a tokenizer tuned for English

  • "generic": a generic tokenizer

  • "vertical": a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)

Specifically, the available tokenizers are determined by the new_*_tokenizer static methods on the MorphoDiTa tokenizer class described in the MorphoDiTa API reference.

static from_tagger(tagger_path)

Load tokenizer associated with tagger file.

tokenize(text, sents=False)

Tokenize text.

Parameters
  • text (str) – Text to tokenize.

  • sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).

Returns

An iterator over the tokenized text, possibly grouped into sentences if sents=True.

Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:

>>> from corpy.morphodita import Tokenizer
>>> t = Tokenizer("generic")
>>> for word in t.tokenize("foo bar baz"):
...     print(word)
...
foo
bar
baz

If you want to iterate over sentences (lists of tokens), set sents=True:

>>> for sentence in t.tokenize("foo bar baz", sents=True):
...     print(sentence)
...
['foo', 'bar', 'baz']
class corpy.morphodita.Token(word, lemma, tag)
word: str

Alias for field number 0

lemma: str

Alias for field number 1

tag: str

Alias for field number 2

class corpy.morphodita.Tagger(tagger_path: Union[pathlib.Path, str])

A MorphoDiTa morphological tagger and lemmatizer.

Parameters

tagger_path (str or pathlib.Path) – Path to the pre-compiled tagging models to load.

tag(text, *, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]

Perform morphological tagging and lemmatization on text.

If text is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables (typically a list of lists), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.

Parameters
  • text (either str (tokenization is left to the tagger) or iterable of iterables (of str), representing individual sentences) – Input text.

  • sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).

  • guesser (bool) – Whether to use the morphological guesser provided with the tagger (if available).

  • convert (str, one of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required) – Conversion strategy to apply to lemmas and / or tags before outputting them.

Returns

An iterator over the tagged text, possibly grouped into sentences if sents=True.

>>> tagger = Tagger("./czech-morfflex-pdt-161115.tagger")
>>> from pprint import pprint
>>> tokens = list(tagger.tag("Je zima. Bude sněžit."))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
 Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']]))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
 Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True))
>>> pprint(sents)
[[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
  Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
  Token(word='.', lemma='.', tag='Z:-------------')],
 [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
  Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
  Token(word='.', lemma='.', tag='Z:-------------')]]
tag_untokenized(text, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]

This is the method tag() delegates to when text is a string. See docstring for tag() for details about parameters.

tag_tokenized(text, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]]

This is the method tag() delegates to when text is an iterable of iterables of strings. See docstring for tag() for details about parameters.