corpy.morphodita
¶
Convenient and easy-to-use MorphoDiTa wrappers.
- class corpy.morphodita.Tokenizer(tokenizer_type)¶
A wrapper API around the tokenizers offered by MorphoDiTa.
- Parameters
tokenizer_type (str) – Type of the requested tokenizer (cf. below for possible values).
tokenizer_type is typically one of:
"czech"
: a tokenizer tuned for Czech"english"
: a tokenizer tuned for English"generic"
: a generic tokenizer"vertical"
: a simple tokenizer for the vertical format, which is effectively already tokenized (one word per line)
Specifically, the available tokenizers are determined by the
new_*_tokenizer
static methods on the MorphoDiTatokenizer
class described in the MorphoDiTa API reference.- static from_tagger(tagger_path)¶
Load tokenizer associated with tagger file.
- tokenize(text, sents=False)¶
Tokenize text.
- Parameters
text (str) – Text to tokenize.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).
- Returns
An iterator over the tokenized text, possibly grouped into sentences if
sents=True
.
Note that MorphoDiTa performs both sentence splitting and tokenization at the same time, but this method iterates over tokens without sentence boundaries by default:
>>> from corpy.morphodita import Tokenizer >>> t = Tokenizer("generic") >>> for word in t.tokenize("foo bar baz"): ... print(word) ... foo bar baz
If you want to iterate over sentences (lists of tokens), set
sents=True
:>>> for sentence in t.tokenize("foo bar baz", sents=True): ... print(sentence) ... ['foo', 'bar', 'baz']
- class corpy.morphodita.Token(word, lemma, tag)¶
- word: str¶
Alias for field number 0
- lemma: str¶
Alias for field number 1
- tag: str¶
Alias for field number 2
- class corpy.morphodita.Tagger(tagger_path: Union[pathlib.Path, str])¶
A MorphoDiTa morphological tagger and lemmatizer.
- Parameters
tagger_path (str or pathlib.Path) – Path to the pre-compiled tagging models to load.
- tag(text, *, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]] ¶
Perform morphological tagging and lemmatization on text.
If
text
is a string, sentence-split, tokenize and tag that string. If it’s an iterable of iterables (typically a list of lists), then take each nested iterable as a separate sentence and tag it, honoring the provided sentence boundaries and tokenization.- Parameters
text (either str (tokenization is left to the tagger) or iterable of iterables (of str), representing individual sentences) – Input text.
sents (bool) – Whether to signal sentence boundaries by outputting a sequence of lists (sentences).
guesser (bool) – Whether to use the morphological guesser provided with the tagger (if available).
convert (str, one of "pdt_to_conll2009", "strip_lemma_comment" or "strip_lemma_id", or None if no conversion is required) – Conversion strategy to apply to lemmas and / or tags before outputting them.
- Returns
An iterator over the tagged text, possibly grouped into sentences if
sents=True
.
>>> tagger = Tagger("./czech-morfflex-pdt-161115.tagger") >>> from pprint import pprint >>> tokens = list(tagger.tag("Je zima. Bude sněžit.")) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']])) >>> pprint(tokens) [Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------'), Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')] >>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True)) >>> pprint(sents) [[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'), Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'), Token(word='.', lemma='.', tag='Z:-------------')], [Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'), Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'), Token(word='.', lemma='.', tag='Z:-------------')]]
- tag_untokenized(text, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]] ¶
This is the method
tag()
delegates to when text is a string. See docstring fortag()
for details about parameters.
- tag_tokenized(text, sents=False, guesser=False, convert=None) Union[Iterator[corpy.morphodita.tagger.Token], Iterator[List[corpy.morphodita.tagger.Token]]] ¶
This is the method
tag()
delegates to when text is an iterable of iterables of strings. See docstring fortag()
for details about parameters.