corpy.morphodita#

Convenient and easy-to-use MorphoDiTa wrappers.

class corpy.morphodita.Tokenizer(tokenizer_type: str)#

A wrapper API around the tokenizers offered by MorphoDiTa.

Parameters:

tokenizer_type – Tokenizer type, see below for possible values.

tokenizer_type is typically one of:

  • "czech": a tokenizer tuned for Czech

  • "english": a tokenizer tuned for English

  • "generic": a generic tokenizer

  • "vertical": a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)

Specifically, the available tokenizers are determined by the new_*_tokenizer static methods on the MorphoDiTa tokenizer class described in the MorphoDiTa API reference.

classmethod from_tagger(tagger_path: str | Path) t.Self#

Load tokenizer associated with tagger file.

tokenize(text: str, sents: Literal[False]) Iterator[str]#
tokenize(text: str, sents: Literal[True]) Iterator[list[str]]
tokenize(text: str, sents: bool = False) Iterator[str] | Iterator[list[str]]

Tokenize text.

Parameters:
  • text – Text to tokenize.

  • sents – If True, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.

Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:

>>> from corpy.morphodita import Tokenizer
>>> t = Tokenizer("generic")
>>> for word in t.tokenize("foo bar baz"):
...     print(word)
...
foo
bar
baz

If you want to iterate over sentences (lists of tokens), set sents=True:

>>> for sentence in t.tokenize("foo bar baz", sents=True):
...     print(sentence)
...
['foo', 'bar', 'baz']
class corpy.morphodita.Token(word, lemma, tag)#
word: str#

Alias for field number 0

lemma: str#

Alias for field number 1

tag: str#

Alias for field number 2

class corpy.morphodita.Tagger(tagger_path: Path | str)#

A MorphoDiTa morphological tagger and lemmatizer.

Parameters:

tagger_path – Path to the pre-compiled tagging models to load.

tag(text: str | Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token]#
tag(text: str | Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
tag(text: str | Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]

Perform morphological tagging and lemmatization on text.

If text is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables of strings (typically a list of lists of strings), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.

Parameters:
  • text – Input text:

  • sents – If True, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.

  • guesser – If True, use the morphological guesser provided with the tagger (if available).

  • convert – Conversion strategy to apply to lemmas and / or tags before outputting them. One of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required.

>>> tagger = Tagger("./czech-morfflex-pdt.tagger")
>>> from pprint import pprint
>>> tokens = list(tagger.tag("Je zima. Bude sněžit."))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
 Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']]))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
 Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True))
>>> pprint(sents)
[[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
  Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
  Token(word='.', lemma='.', tag='Z:-------------')],
 [Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
  Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
  Token(word='.', lemma='.', tag='Z:-------------')]]
tag_untokenized(text: str, *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token]#
tag_untokenized(text: str, *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
tag_untokenized(text: str, *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]

This is the method tag() delegates to when text is a string. See docstring for tag() for details about parameters.

tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token]#
tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
tag_tokenized(text: Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]

This is the method tag() delegates to when text is an iterable of iterables of strings. See docstring for tag() for details about parameters.