corpy.udpipe#

Tokenizing, tagging and parsing text with UDPipe.

exception corpy.udpipe.UdpipeError#

An error which occurred in the ufal.udpipe C extension.

class corpy.udpipe.Model(model_path)#

A UDPipe model for tagging and parsing text.

Parameters:

model_path (str or pathlib.Path) – Path to the pre-compiled UDPipe model to load.

process(text, *, tag=True, parse=True, in_format=None, out_format=None)#

Process input text, yielding sentences one by one.

The text is always at least tokenized, and optionally morphologically tagged and syntactically parsed, depending on the values of the tag and parse arguments.

Parameters:
  • text (str) – Text to process.

  • tag (bool) – Perform morphological tagging.

  • parse (bool) – Perform syntactic parsing.

  • in_format (None or str) – Input format (cf. below for possible values).

  • out_format (None or str) – Output format (cf. below for possible values).

The input text is a string in one of the following formats (specified by in_format):

  • None: freeform text, which will be sentence split and tokenized by UDPipe

  • "conllu": the CoNLL-U format

  • "horizontal": one sentence per line, word forms separated by spaces

  • "vertical": one word per line, empty lines denote sentence ends

The output format is specified by out_format:

  • None: native ufal.udpipe objects, suitable for further manipulation in Python

  • "conllu", "horizontal" or "vertical": cf. above

  • "epe": the EPE (Extrinsic Parser Evaluation 2017) interchange format

  • "matxin": the Matxin XML format

  • "plaintext": reconstruct text with original spaces, discarding annotations

New input and output formats may be added with new releases of UDPipe; for an up-to-date list, consult the UDPipe API reference.

tag(sent)#

Perform morphological tagging on sentence.

Modifies sent in place.

Parameters:

sent (ufal.udpipe.Sentence) – Sentence to tag.

parse(sent)#

Perform syntactic parsing on sentence.

Modifies sent in place.

Parameters:

sent (ufal.udpipe.Sentence) – Sentence to parse.

corpy.udpipe.load(corpus, in_format='conllu')#

Load corpus in input format.

Parameters:
  • corpus (str) – The data to load.

  • in_format (str) – Cf. the documentation of Model.process().

Returns:

A generator of sentences (ufal.udpipe.Sentence).

corpy.udpipe.dump(sent_or_sents, out_format='conllu')#

Dump sentence or sentences in output format.

Parameters:
  • sent_or_sents – The data to dump.

  • out_format (str) – Cf. the documentation of Model.process().

Returns:

A generator of strings, corresponding to the serialized sentences. One final additional string may contain any closing markup, if required by the output format.

corpy.udpipe.pprint(obj)#

Pretty-print object.

This is a convenience wrapper over IPython.lib.pretty.pprint() for easier importing.

corpy.udpipe.pprint_config(*, digest=True)#

Configure pretty-printing of ufal.udpipe objects.

Parameters:

digest (bool) – Show only attributes with interesting values (other than '' or -1)