corpy.morphodita
#
Convenient and easy-to-use MorphoDiTa wrappers.
- class corpy.morphodita.Tokenizer(tokenizer_type: str)#
A wrapper API around the tokenizers offered by MorphoDiTa.
- Parameters:
tokenizer_type – Tokenizer type, see below for possible values.
tokenizer_type
is typically one of:"czech"
: a tokenizer tuned for Czech"english"
: a tokenizer tuned for English"generic"
: a generic tokenizer"vertical"
: a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)
Specifically, the available tokenizers are determined by the
new_*_tokenizer
static methods on the MorphoDiTatokenizer
class described in the MorphoDiTa API reference.- classmethod from_tagger(tagger_path: str | Path) t.Self #
Load tokenizer associated with tagger file.
- tokenize(text: str, sents: Literal[False]) Iterator[str] #
- tokenize(text: str, sents: Literal[True]) Iterator[list[str]]
- tokenize(text: str, sents: bool = False) Iterator[str] | Iterator[list[str]]
Tokenize text.
- Parameters:
text – Text to tokenize.
sents – If
True
, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.
Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:
>>> from corpy.morphodita import Tokenizer >>> t = Tokenizer("generic") >>> for word in t.tokenize("foo bar baz"): ... print(word) ... foo bar baz
If you want to iterate over sentences (lists of tokens), set
sents=True
:>>> for sentence in t.tokenize("foo bar baz", sents=True): ... print(sentence) ... ['foo', 'bar', 'baz']
- class corpy.morphodita.Token(word, lemma, tag)#
- word: str#
Alias for field number 0
- lemma: str#
Alias for field number 1
- tag: str#
Alias for field number 2
- class corpy.morphodita.Tagger(tagger_path: Path | str)#
A MorphoDiTa morphological tagger and lemmatizer.
- Parameters:
tagger_path – Path to the pre-compiled tagging models to load.
- tag(text: str | Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token] #
- tag(text: str | Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
- tag(text: str | Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]
Perform morphological tagging and lemmatization on text.
If
text
is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables of strings (typically a list of lists of strings), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.- Parameters:
text – Input text:
sents – If
True
, return an iterator of lists of tokens, each list being a sentence, instead of a flat iterator of tokens.guesser – If
True
, use the morphological guesser provided with the tagger (if available).convert – Conversion strategy to apply to lemmas and / or tags before outputting them. One of
"pdt_to_conll2009"
,"strip_lemma_comment"
or"strip_lemma_id"
, orNone
if no conversion is required.
>>> tagger = Tagger("./czech-morfflex-pdt.tagger") >>> from pprint import pprint >>> tokens = list(tagger.tag("Je zima. Bude sněžit.")) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'), Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'), Token(word='.', lemma='.', tag='Z:-------------')] >>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']])) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'), Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'), Token(word='.', lemma='.', tag='Z:-------------')] >>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True)) >>> pprint(sents) [[Token(word='Je', lemma='být', tag='VB-S---3P-AAI--'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------')], [Token(word='Bude', lemma='být', tag='VB-S---3F-AAI--'), Token(word='sněžit', lemma='sněžit', tag='Vf--------A-I--'), Token(word='.', lemma='.', tag='Z:-------------')]]
- tag_untokenized(text: str, *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token] #
- tag_untokenized(text: str, *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
- tag_untokenized(text: str, *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]
This is the method
tag()
delegates to when text is a string. See docstring fortag()
for details about parameters.
- tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[False] = False, guesser: bool = False, convert: str | None = None) Iterator[Token] #
- tag_tokenized(text: Iterable[Iterable[str]], *, sents: Literal[True] = False, guesser: bool = False, convert: str | None = None) Iterator[list[corpy.morphodita.tagger.Token]]
- tag_tokenized(text: Iterable[Iterable[str]], *, sents: bool = False, guesser: bool = False, convert: str | None = None) Iterator[Token] | Iterator[list[corpy.morphodita.tagger.Token]]
This is the method
tag()
delegates to when text is an iterable of iterables of strings. See docstring fortag()
for details about parameters.