Tokenize and tag text with MorphoDiTa¶
Overview¶
The corpy.morphodita
sub-package offers a more user friendly wrapper
around the default Swig-generated Python bindings for the MorphoDiTa morphological tagging and lemmatization
framework.
The target audiences are:
beginner programmers interested in NLP
seasoned programmers who want to use MorphoDiTa through a more Pythonic interface, without having to dig into the API reference and the examples, and who are not too worried about a possible performance hit as compared with full manual control
Pre-trained tagging models which can be used with MorphoDiTa can be found here. Currently, Czech and English models are available. Please respect their CC BY-NC-SA 3.0 license!
At the moment, only a subset of the functionality offered by the MorphoDiTa API
is available through corpy.morphodita
(tokenization, tagging).
If stuck, check out the module’s API reference
for
more details.
Tokenization¶
When instantiating a Tokenizer
, pass in a string
which will determine the type of tokenizer to create. Valid options are
"czech"
, "english"
, "generic"
and "vertical"
(cf. also the
new_*_tokenizer
methods in the MorphoDiTa API reference).
>>> from corpy.morphodita import Tokenizer
>>> tokenizer = Tokenizer("generic")
>>> for word in tokenizer.tokenize("foo bar baz"):
... print(word)
...
foo
bar
baz
Alternatively, if you want to use the tokenizer associated with a MorphoDiTa
*.tagger
file you have available, you can instantiate it using
from_tagger()
.
If you’re interested in sentence boundaries too, pass sents=True
to
tokenize()
:
>>> for sentence in tokenizer.tokenize("foo bar baz", sents=True):
... print(sentence)
...
['foo', 'bar', 'baz']
Tagging¶
NOTE: Unlike tokenization, tagging in MorphoDiTa requires you to supply your own pre-trained tagging models (see Overview above).
Initialize a new tagger:
>>> from corpy.morphodita import Tagger
>>> tagger = Tagger("./czech-morfflex-pdt-161115.tagger")
Tokenize, tag and lemmatize a text represented as a string:
>>> from pprint import pprint
>>> tokens = list(tagger.tag("Je zima. Bude sněžit."))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
Token(word='.', lemma='.', tag='Z:-------------'),
Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
Token(word='.', lemma='.', tag='Z:-------------')]
With sentence boundaries:
>>> sents = list(tagger.tag("Je zima. Bude sněžit.", sents=True))
>>> pprint(sents)
[[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
Token(word='.', lemma='.', tag='Z:-------------')],
[Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
Token(word='.', lemma='.', tag='Z:-------------')]]
Tag and lemmatize an already sentence-split and tokenized piece of text, represented as an iterable of iterables of strings:
>>> tokens = list(tagger.tag([['Je', 'zima', '.'], ['Bude', 'sněžit', '.']]))
>>> pprint(tokens)
[Token(word='Je', lemma='být', tag='VB-S---3P-AA---'),
Token(word='zima', lemma='zima-1', tag='NNFS1-----A----'),
Token(word='.', lemma='.', tag='Z:-------------'),
Token(word='Bude', lemma='být', tag='VB-S---3F-AA---'),
Token(word='sněžit', lemma='sněžit_:T', tag='Vf--------A----'),
Token(word='.', lemma='.', tag='Z:-------------')]