Tag and parse text with UDPipe

NOTE: When playing around with UDPipe interactively, it’s highly recommended to use IPython or a Jupyter notebook. You’ll automatically get nice pretty-printing.

Overview

UDPipe is a fast and convenient library for stochastic morphological tagging (including lemmatization) and syntactic parsing of text. corpy.udpipe aims to give easy access to the most commonly used features of the library; for more advanced use cases, you might need to use the more lower-level ufal.udpipe package, on top of which this module is built.

In order to use UDPipe, you need a pre-trained model for your language of interest. Models are available for many languages, for more information, refer to the UDPipe website. When using the models, please make sure to respect their CC BY-NC-SA license!

In order to better understand how UDPipe represents tagged and parsed text, it is useful to familiarize yourself with the CoNLL-U data format. UDPipe data structures (sentences, words, multi-word tokens, empty nodes, comments) map onto concepts defined in this format.

In addition to this guide, there is also an API reference for corpy.udpipe. For an overview of the API of underlying ufal.udpipe objects (listing available attributes and methods), see here.

Processing text

Tagging and parsing text using UDPipe is fairly simple. Just load a UDPipe Model:

>>> from corpy.udpipe import Model
>>> m = Model("./czech-pdt-ud-2.4-190531.udpipe")

And process some text using the process() method (the method creates a generator, so you’ll need e.g. list() to tease all of the elements out of it):

>>> sents = list(m.process("Je zima. Bude sněžit."))
>>> sents
[<Swig Object of type 'sentence *' at 0x...>, <Swig Object of type 'sentence *' at 0x...>]

Ouch. This output is not really helpful. This is why it’s recommended to use IPython or Jupyter, because at a regular Python REPL, the output of UDPipe is rendered as opaque Swig objects.

However, if the IPython package is at least installed, you can explicitly pretty-print the output using the pprint() function:

>>> from corpy.udpipe import pprint
>>> pprint(sents)
[Sentence(
   comments=['# newdoc', '# newpar', '# sent_id = 1', '# text = Je zima.'],
   words=[
     Word(id=0, <root>),
     Word(id=1,
          form='Je',
          lemma='být',
          xpostag='VB-S---3P-AA---',
          upostag='VERB',
          feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act',
          head=0,
          deprel='root'),
     Word(id=2,
          form='zima',
          lemma='zima',
          xpostag='NNFS1-----A----',
          upostag='NOUN',
          feats='Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos',
          head=1,
          deprel='nsubj',
          misc='SpaceAfter=No'),
     Word(id=3,
          form='.',
          lemma='.',
          xpostag='Z:-------------',
          upostag='PUNCT',
          head=1,
          deprel='punct')]),
 Sentence(
   comments=['# sent_id = 2', '# text = Bude sněžit.'],
   words=[
     Word(id=0, <root>),
     Word(id=1,
          form='Bude',
          lemma='být',
          xpostag='VB-S---3F-AA---',
          upostag='AUX',
          feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act',
          head=2,
          deprel='aux'),
     Word(id=2,
          form='sněžit',
          lemma='sněžit',
          xpostag='Vf--------A----',
          upostag='VERB',
          feats='Aspect=Imp|Polarity=Pos|VerbForm=Inf',
          head=0,
          deprel='root',
          misc='SpaceAfter=No'),
     Word(id=3,
          form='.',
          lemma='.',
          xpostag='Z:-------------',
          upostag='PUNCT',
          head=2,
          deprel='punct',
          misc='SpaceAfter=No')])]

Much better! And again, calling pprint(sents) is not necessary when using IPython or Jupyter, you can just evaluate sents and it will be pretty-printed automatically.

Pretty-printing options

The output of UDPipe can be quite verbose – the individual objects have many fields. However, some values are not really that interesting (e.g. the empty string for string attributes, or -1 for integer attributes). Therefore, they are hidden by the pretty-printer by default, so as to make the output more concise.

Sometimes though, you might want exhaustive pretty-printing, e.g. to learn about all of the possible attributes, even though your output doesn’t happen to have any useful values in them. In order to do that, disable the digest option using the pprint_config() function:

>>> from corpy.udpipe import pprint_config
>>> pprint_config(digest=False)
>>> pprint(sents)
[Sentence(
   comments=['# newdoc', '# newpar', '# sent_id = 1', '# text = Je zima.'],
   words=[
     Word(id=0,
          form='<root>',
          lemma='<root>',
          xpostag='<root>',
          upostag='<root>',
          feats='<root>',
          head=-1,
          deprel='',
          deps='',
          misc=''),
     Word(id=1,
          form='Je',
          lemma='být',
          xpostag='VB-S---3P-AA---',
          upostag='VERB',
          feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act',
          head=0,
          deprel='root',
          deps='',
          misc=''),
     Word(id=2,
          form='zima',
          lemma='zima',
          xpostag='NNFS1-----A----',
          upostag='NOUN',
          feats='Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos',
          head=1,
          deprel='nsubj',
          deps='',
          misc='SpaceAfter=No'),
     Word(id=3,
          form='.',
          lemma='.',
          xpostag='Z:-------------',
          upostag='PUNCT',
          feats='',
          head=1,
          deprel='punct',
          deps='',
          misc='')],
   multiwordTokens=[],
   emptyNodes=[]),
 Sentence(
   comments=['# sent_id = 2', '# text = Bude sněžit.'],
   words=[
     Word(id=0,
          form='<root>',
          lemma='<root>',
          xpostag='<root>',
          upostag='<root>',
          feats='<root>',
          head=-1,
          deprel='',
          deps='',
          misc=''),
     Word(id=1,
          form='Bude',
          lemma='být',
          xpostag='VB-S---3F-AA---',
          upostag='AUX',
          feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act',
          head=2,
          deprel='aux',
          deps='',
          misc=''),
     Word(id=2,
          form='sněžit',
          lemma='sněžit',
          xpostag='Vf--------A----',
          upostag='VERB',
          feats='Aspect=Imp|Polarity=Pos|VerbForm=Inf',
          head=0,
          deprel='root',
          deps='',
          misc='SpaceAfter=No'),
     Word(id=3,
          form='.',
          lemma='.',
          xpostag='Z:-------------',
          upostag='PUNCT',
          feats='',
          head=2,
          deprel='punct',
          deps='',
          misc='SpaceAfter=No')],
   multiwordTokens=[],
   emptyNodes=[])]

Input and output formats

UDPipe supports a variety of input and output formats. For convenience, they are listed in the documentation of the corpy.udpipe.Model.process() method, but the most up-to-date, reference list is always available in the UDPipe API docs.

One format which is particularly useful is the CoNLL-U format: it’s the format of the UniversalDependencies project, and as such, it’s intimately associated with UDPipe, which is also part of the project. Reading up on the CoNLL-U format can help you better understand how UDPipe represents tagged and parsed text, especially some of the less straightforward features (e.g. multi-word tokens and empty nodes).

Say you have a small two-sentence corpus in the “horizontal” format (one sentence per line, words separated by spaces), and you want to tag it, parse it, and output it in the CoNLL-U format. You can do it like so:

>>> horizontal = """Je zima.
... Bude sněžit."""
>>> conllu_sents = list(m.process(horizontal, in_format="horizontal", out_format="conllu"))
>>> conllu_sents
['# newdoc\n# newpar\n# sent_id = 1\n1\tJe\tbýt\tVERB\tVB-S---3P-AA---\tMood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act\t0\troot\t_\t_\n2\tzima.\tzima.\tPUNCT\tZ:-------------\t_\t1\tpunct\t_\t_\n\n', '# sent_id = 2\n1\tBude\tbýt\tVERB\tVB-S---3F-AA---\tMood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act\t0\troot\t_\t_\n2\tsněžit.\tsněžit.\tPUNCT\tZ:-------------\t_\t1\tpunct\t_\t_\n\n']

That’s a bit messy, but trust me that conllu_sents is just a list of two strings, each string representing one sentence. Or, if you don’t trust me:

>>> len(conllu_sents)
2
>>> [type(x) for x in conllu_sents]
[<class 'str'>, <class 'str'>]

To give you an idea of the format, let’s just join the sentences and print them out:

>>> print("".join(conllu_sents), end="")  
# newdoc
# newpar
# sent_id = 1
1    Je      být     VERB    VB-S---3P-AA--- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act    0       root    _       _
2    zima.   zima.   PUNCT   Z:------------- _       1       punct   _       _

# sent_id = 2
1    Bude    být     VERB    VB-S---3F-AA--- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act     0       root    _       _
2    sněžit. sněžit. PUNCT   Z:------------- _       1       punct   _       _