corpy.udpipe
#
Tokenizing, tagging and parsing text with UDPipe.
- exception corpy.udpipe.UdpipeError#
An error which occurred in the
ufal.udpipe
C extension.
- class corpy.udpipe.Model(model_path)#
A UDPipe model for tagging and parsing text.
- Parameters:
model_path (str or pathlib.Path) – Path to the pre-compiled UDPipe model to load.
- process(text, *, tag=True, parse=True, in_format=None, out_format=None)#
Process input text, yielding sentences one by one.
The text is always at least tokenized, and optionally morphologically tagged and syntactically parsed, depending on the values of the
tag
andparse
arguments.- Parameters:
text (str) – Text to process.
tag (bool) – Perform morphological tagging.
parse (bool) – Perform syntactic parsing.
in_format (None or str) – Input format (cf. below for possible values).
out_format (None or str) – Output format (cf. below for possible values).
The input text is a string in one of the following formats (specified by
in_format
):None
: freeform text, which will be sentence split and tokenized by UDPipe"conllu"
: the CoNLL-U format"horizontal"
: one sentence per line, word forms separated by spaces"vertical"
: one word per line, empty lines denote sentence ends
The output format is specified by
out_format
:None
: nativeufal.udpipe
objects, suitable for further manipulation in Python"conllu"
,"horizontal"
or"vertical"
: cf. above"epe"
: the EPE (Extrinsic Parser Evaluation 2017) interchange format"matxin"
: the Matxin XML format"plaintext"
: reconstruct text with original spaces, discarding annotations
New input and output formats may be added with new releases of UDPipe; for an up-to-date list, consult the UDPipe API reference.
- tag(sent)#
Perform morphological tagging on sentence.
Modifies
sent
in place.- Parameters:
sent (ufal.udpipe.Sentence) – Sentence to tag.
- parse(sent)#
Perform syntactic parsing on sentence.
Modifies
sent
in place.- Parameters:
sent (ufal.udpipe.Sentence) – Sentence to parse.
- corpy.udpipe.load(corpus, in_format='conllu')#
Load corpus in input format.
- Parameters:
corpus (str) – The data to load.
in_format (str) – Cf. the documentation of
Model.process()
.
- Returns:
A generator of sentences (
ufal.udpipe.Sentence
).
- corpy.udpipe.dump(sent_or_sents, out_format='conllu')#
Dump sentence or sentences in output format.
- Parameters:
sent_or_sents – The data to dump.
out_format (str) – Cf. the documentation of
Model.process()
.
- Returns:
A generator of strings, corresponding to the serialized sentences. One final additional string may contain any closing markup, if required by the output format.
- corpy.udpipe.pprint(obj)#
Pretty-print object.
This is a convenience wrapper over
IPython.lib.pretty.pprint()
for easier importing.
- corpy.udpipe.pprint_config(*, digest=True)#
Configure pretty-printing of
ufal.udpipe
objects.- Parameters:
digest (bool) – Show only attributes with interesting values (other than
''
or-1
)