corpy.udpipe

Tokenizing, tagging and parsing text with UDPipe.

class corpy.udpipe.Model(model_path)

A UDPipe model for tagging and parsing text.

Parameters

model_path (str) – Path to the pre-compiled UDPipe model to load.

process(text, *, tag=True, parse=True, in_format=None, out_format=None)

Process input text, yielding sentences one by one.

The text is always at least tokenized, and optionally morphologically tagged and syntactically parsed, depending on the values of the tag and parse arguments.

Parameters
  • text (str) – Text to process.

  • tag (bool) – Perform morphological tagging.

  • parse (bool) – Perform syntactic parsing.

  • in_format (None or str) – Input format (cf. below for possible values).

  • out_format (None or str) – Output format (cf. below for possible values).

The input text is a string in one of the following formats (specified by in_format):

  • None: freeform text, which will be sentence split and tokenized by UDPipe

  • "conllu": the CoNLL-U format

  • "horizontal": one sentence per line, word forms separated by spaces

  • "vertical": one word per line, empty lines denote sentence ends

The output format is specified by out_format:

  • None: native ufal.udpipe objects, suitable for further manipulation in Python

  • "conllu", "horizontal" or "vertical": cf. above

  • "epe": the EPE (Extrinsic Parser Evaluation 2017) interchange format

  • "matxin": the Matxin XML format

  • "plaintext": reconstruct text with original spaces, discarding annotations

New input and output formats may be added with new releases of UDPipe; for an up-to-date list, consult the UDPipe API reference.

exception corpy.udpipe.UdpipeError

An error which occurred in the ufal.udpipe C extension.

corpy.udpipe.pprint(obj)

Pretty-print object.

This is a convenience wrapper over IPython.lib.pretty.pprint() for easier importing.

corpy.udpipe.pprint_config(*, digest=True)

Configure pretty-printing of ufal.udpipe objects.

Parameters

digest (bool) – Show only attributes with interesting values (other than '' or -1)