`corpy.morphodita`#

Convenient and easy-to-use MorphoDiTa wrappers.

class corpy.morphodita.Tokenizer(tokenizer_type: str)#

A wrapper API around the tokenizers offered by MorphoDiTa.

Parameters:: tokenizer_type – Tokenizer type, see below for possible values.

tokenizer_type is typically one of:

"czech": a tokenizer tuned for Czech
"english": a tokenizer tuned for English
"generic": a generic tokenizer
"vertical": a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)

Specifically, the available tokenizers are determined by the new_*_tokenizer static methods on the MorphoDiTa tokenizer class described in the MorphoDiTa API reference.

classmethod from_tagger(tagger_path: str | Path) → t.Self#: Load tokenizer associated with tagger file.

tokenize(text: str, sents: Literal[False]) → Iterator[str]#

tokenize(text: str, sents: Literal[True]) → Iterator[list[str]]

tokenize(text: str, sents: bool = False) → Iterator[str] | Iterator[list[str]]

Tokenize text.

Parameters:

text – Text to tokenize.
sents – If True, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.

Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:

>>> from corpy.morphodita import Tokenizer
>>> t = Tokenizer("generic")
>>> for word in t.tokenize("foo bar baz"):
...     print(word)
...
foo
bar
baz

If you want to iterate over sentences (lists of tokens), set sents=True:

>>> for sentence in t.tokenize("foo bar baz", sents=True):
...     print(sentence)
...
['foo', 'bar', 'baz']

class corpy.morphodita.Token(word, lemma, tag)#

word: str#: Alias for field number 0

lemma: str#: Alias for field number 1

tag: str#: Alias for field number 2

class corpy.morphodita.Tagger(tagger_path: Path | str)#

A MorphoDiTa morphological tagger and lemmatizer.

Parameters:: tagger_path – Path to the pre-compiled tagging models to load.

tag(text: str | Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) → Iterator[Token]#

tag(text: str | Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) → Iterator[list[corpy.morphodita.tagger.Token]]

tag(text: str | Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) → Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]

Perform morphological tagging and lemmatization on text.

If text is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables of strings (typically a list of lists of strings), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.

Parameters:

text – Input text:
sents – If True, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.
guesser – If True, use the morphological guesser provided with the tagger (if available).
convert – Conversion strategy to apply to lemmas and / or tags before outputting them. One of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required.

>>> tagger = Tagger("./czech-morfflex-pdt.tagger")
>>> from pprint import pprint
>>> tokens = list(tagger.tag("Je zima. Bude sněžit."))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
 Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']]))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
 Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
 Token(word='.', lemma='.', tag='Z:-------------'),
 Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
 Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
 Token(word='.', lemma='.', tag='Z:-------------')]
>>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True))
>>> pprint(sents)
[[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'),
  Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
  Token(word='.', lemma='.', tag='Z:-------------')],
 [Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'),
  Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'),
  Token(word='.', lemma='.', tag='Z:-------------')]]

tag_untokenized(text: str, *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) → Iterator[Token]#
tag_untokenized(text: str, *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) → Iterator[list[corpy.morphodita.tagger.Token]]
tag_untokenized(text: str, *, sents: bool = False, guesser: bool = False, convert: str | None = None) → Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]: This is the method tag() delegates to when text is a string. See docstring for tag() for details about parameters.

tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) → Iterator[Token]#
tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) → Iterator[list[corpy.morphodita.tagger.Token]]
tag_tokenized(text: Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) → Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]: This is the method tag() delegates to when text is an iterable of iterables of strings. See docstring for tag() for details about parameters.

corpy.morphodita#

`corpy.morphodita`#