============================== Tag and parse text with UDPipe ============================== **NOTE:** When playing around with UDPipe interactively, it's highly recommended to use IPython_ or a Jupyter_ notebook. You'll automatically get nice pretty-printing. .. _IPython: https://ipython.org/ .. _Jupyter: https://jupyter.org/ Overview ======== UDPipe_ is a fast and convenient library for stochastic morphological tagging (including lemmatization) and syntactic parsing of text. The :mod:`corpy.udpipe` module aims to give easy access to the most commonly used features of the library; for more advanced use cases, including if you need speedups in performance critical code, you might need to use the more lower-level ufal.udpipe_ package, on top of which this module is built. .. _UDPipe: http://ufal.mff.cuni.cz/udpipe .. _ufal.udpipe: https://pypi.org/project/ufal.udpipe/ In order to use UDPipe, you need a pre-trained model for your language of interest. Models are available for many languages, for more information, refer to the `UDPipe website `__. **When using the models, please make sure to respect their CC BY-NC-SA license!** In order to better understand how UDPipe represents tagged and parsed text, it is useful to familiarize yourself with the CoNLL-U_ data format. UDPipe data structures (sentences, words, multi-word tokens, empty nodes, comments) map onto concepts defined in this format. .. _CoNLL-U: https://universaldependencies.org/format.html In addition to this guide, there is also an :mod:`API reference ` for :mod:`corpy.udpipe`. For an overview of the API of underlying :mod:`ufal.udpipe` objects (listing available attributes and methods), see `here `__. Processing text =============== Tagging and parsing text using UDPipe is fairly simple. Just load a UDPipe :class:`~corpy.udpipe.Model`: .. code:: python >>> from corpy.udpipe import Model >>> m = Model("./czech-pdt-ud-2.4-190531.udpipe") And process some text using the :meth:`~corpy.udpipe.Model.process` method (the method creates a generator, so you'll need e.g. :func:`list` to tease all of the elements out of it): .. code:: python >>> sents = list(m.process("Je zima. Bude sněžit.")) >>> sents [, ] Ouch. This output is not really helpful. This is why it's recommended to use IPython_ or Jupyter_, because at a regular Python REPL, the output of UDPipe is rendered as opaque Swig_ objects. .. _Swig: http://www.swig.org/ However, if the IPython package is at least installed, you can explicitly pretty-print the output using the :func:`~corpy.udpipe.pprint` function: .. code:: python >>> from corpy.udpipe import pprint >>> pprint(sents) [Sentence( comments=['# newdoc', '# newpar', '# sent_id = 1', '# text = Je zima.'], words=[ Word(id=0, ), Word(id=1, form='Je', lemma='být', xpostag='VB-S---3P-AA---', upostag='VERB', feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act', head=0, deprel='root'), Word(id=2, form='zima', lemma='zima', xpostag='NNFS1-----A----', upostag='NOUN', feats='Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos', head=1, deprel='nsubj', misc='SpaceAfter=No'), Word(id=3, form='.', lemma='.', xpostag='Z:-------------', upostag='PUNCT', head=1, deprel='punct')]), Sentence( comments=['# sent_id = 2', '# text = Bude sněžit.'], words=[ Word(id=0, ), Word(id=1, form='Bude', lemma='být', xpostag='VB-S---3F-AA---', upostag='AUX', feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act', head=2, deprel='aux'), Word(id=2, form='sněžit', lemma='sněžit', xpostag='Vf--------A----', upostag='VERB', feats='Aspect=Imp|Polarity=Pos|VerbForm=Inf', head=0, deprel='root', misc='SpaceAfter=No'), Word(id=3, form='.', lemma='.', xpostag='Z:-------------', upostag='PUNCT', head=2, deprel='punct', misc='SpaceAfter=No')])] Much better! And again, calling ``pprint(sents)`` is not necessary when using IPython_ or Jupyter_, you can just evaluate ``sents`` and it will be pretty-printed automatically. Pretty-printing options ======================= The output of UDPipe can be quite verbose -- the individual objects have many fields. However, some values are not really that interesting (e.g. the empty string for string attributes, or ``-1`` for integer attributes). Therefore, they are hidden by the pretty-printer by default, so as to make the output more concise. Sometimes though, you might want exhaustive pretty-printing, e.g. to learn about all of the possible attributes, even though your output doesn't happen to have any useful values in them. In order to do that, disable the ``digest`` option using the :func:`~corpy.udpipe.pprint_config` function: .. code:: python >>> from corpy.udpipe import pprint_config >>> pprint_config(digest=False) >>> pprint(sents) [Sentence( comments=['# newdoc', '# newpar', '# sent_id = 1', '# text = Je zima.'], words=[ Word(id=0, form='', lemma='', xpostag='', upostag='', feats='', head=-1, deprel='', deps='', misc=''), Word(id=1, form='Je', lemma='být', xpostag='VB-S---3P-AA---', upostag='VERB', feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act', head=0, deprel='root', deps='', misc=''), Word(id=2, form='zima', lemma='zima', xpostag='NNFS1-----A----', upostag='NOUN', feats='Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos', head=1, deprel='nsubj', deps='', misc='SpaceAfter=No'), Word(id=3, form='.', lemma='.', xpostag='Z:-------------', upostag='PUNCT', feats='', head=1, deprel='punct', deps='', misc='')], multiwordTokens=[], emptyNodes=[]), Sentence( comments=['# sent_id = 2', '# text = Bude sněžit.'], words=[ Word(id=0, form='', lemma='', xpostag='', upostag='', feats='', head=-1, deprel='', deps='', misc=''), Word(id=1, form='Bude', lemma='být', xpostag='VB-S---3F-AA---', upostag='AUX', feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act', head=2, deprel='aux', deps='', misc=''), Word(id=2, form='sněžit', lemma='sněžit', xpostag='Vf--------A----', upostag='VERB', feats='Aspect=Imp|Polarity=Pos|VerbForm=Inf', head=0, deprel='root', deps='', misc='SpaceAfter=No'), Word(id=3, form='.', lemma='.', xpostag='Z:-------------', upostag='PUNCT', feats='', head=2, deprel='punct', deps='', misc='SpaceAfter=No')], multiwordTokens=[], emptyNodes=[])] Let's turn digest back on to save space below. .. code:: python >>> pprint_config(digest=True) Input and output formats ======================== UDPipe supports a variety of input and output formats. For convenience, they are listed in the documentation of the :meth:`corpy.udpipe.Model.process` method, but the most up-to-date, reference list is always available in the `UDPipe API docs `__. One format which is particularly useful is the CoNLL-U_ format: it's the format of the UniversalDependencies_ project, and as such, it's intimately associated with UDPipe, which is also part of the project. Reading up on the CoNLL-U_ format can help you better understand how UDPipe represents tagged and parsed text, especially some of the less straightforward features (e.g. `multi-word tokens and empty nodes `__). .. _UniversalDependencies: https://universaldependencies.org Say you have a small two-sentence corpus in the "horizontal" format (one sentence per line, words separated by spaces), and you want to tag it, parse it, and output it in the CoNLL-U format. You can do it like so: .. code:: python >>> horizontal = """Je zima . ... Bude sněžit .""" >>> conllu_sents = list(m.process(horizontal, in_format="horizontal", out_format="conllu")) >>> conllu_sents ['# newdoc\n# newpar\n# sent_id = 1\n1\tJe\tbýt\tVERB\tVB-S---3P-AA---\tMood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act\t0\troot\t_\t_\n2\tzima\tzima\tNOUN\tNNFS1-----A----\tCase=Nom|Gender=Fem|Number=Sing|Polarity=Pos\t1\tnsubj\t_\t_\n3\t.\t.\tPUNCT\tZ:-------------\t_\t1\tpunct\t_\t_\n\n', '# sent_id = 2\n1\tBude\tbýt\tAUX\tVB-S---3F-AA---\tMood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act\t2\taux\t_\t_\n2\tsněžit\tsněžit\tVERB\tVf--------A----\tAspect=Imp|Polarity=Pos|VerbForm=Inf\t0\troot\t_\t_\n3\t.\t.\tPUNCT\tZ:-------------\t_\t2\tpunct\t_\t_\n\n'] That's a bit messy, but trust me that ``conllu_sents`` is just a list of two strings, each string representing one sentence. Or, if you don't trust me: .. code:: python >>> len(conllu_sents) 2 >>> [type(x) for x in conllu_sents] [, ] To give you an idea of the format, let's just join the sentences and print them out: .. code:: python >>> print("".join(conllu_sents), end="") # doctest: +NORMALIZE_WHITESPACE # newdoc # newpar # sent_id = 1 1 Je být VERB VB-S---3P-AA--- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ _ 2 zima zima NOUN NNFS1-----A---- Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos 1 nsubj _ _ 3 . . PUNCT Z:------------- _ 1 punct _ _ # sent_id = 2 1 Bude být AUX VB-S---3F-AA--- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin|Voice=Act 2 aux _ _ 2 sněžit sněžit VERB Vf--------A---- Aspect=Imp|Polarity=Pos|VerbForm=Inf 0 root _ _ 3 . . PUNCT Z:------------- _ 2 punct _ _ Format conversion ================= The module can also be used just for loading/dumping data in any of the formats supported by UDPipe. That's what the :func:`~corpy.udpipe.load` and :func:`~corpy.udpipe.dump` functions are for. Input and output formats default to CoNLL-U. .. code:: python >>> from corpy.udpipe import load, dump >>> sents = list(load(horizontal, "horizontal")) >>> pprint(sents) [Sentence( comments=['# newdoc', '# newpar', '# sent_id = 1'], words=[ Word(id=0, ), Word(id=1, form='Je'), Word(id=2, form='zima'), Word(id=3, form='.')]), Sentence( comments=['# sent_id = 2'], words=[ Word(id=0, ), Word(id=1, form='Bude'), Word(id=2, form='sněžit'), Word(id=3, form='.')])] >>> print("".join(dump(sents)), end="") # doctest: +NORMALIZE_WHITESPACE # newdoc # newpar # sent_id = 1 1 Je _ _ _ _ _ _ _ _ 2 zima _ _ _ _ _ _ _ _ 3 . _ _ _ _ _ _ _ _ # sent_id = 2 1 Bude _ _ _ _ _ _ _ _ 2 sněžit _ _ _ _ _ _ _ _ 3 . _ _ _ _ _ _ _ _ You can mix and match this with tagging and parsing the data using a :class:`~corpy.udpipe.Model`, if you prefer this more incremental approach: .. code:: python >>> m.tag(sents[0]) >>> m.parse(sents[0]) >>> pprint(sents) [Sentence( comments=['# newdoc', '# newpar', '# sent_id = 1'], words=[ Word(id=0, ), Word(id=1, form='Je', lemma='být', xpostag='VB-S---3P-AA---', upostag='VERB', feats='Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act', head=0, deprel='root'), Word(id=2, form='zima', lemma='zima', xpostag='NNFS1-----A----', upostag='NOUN', feats='Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos', head=1, deprel='nsubj'), Word(id=3, form='.', lemma='.', xpostag='Z:-------------', upostag='PUNCT', head=1, deprel='punct')]), Sentence( comments=['# sent_id = 2'], words=[ Word(id=0, ), Word(id=1, form='Bude'), Word(id=2, form='sněžit'), Word(id=3, form='.')])] As you can see, only the first sentence has been tagged and parsed. Note that the :meth:`~corpy.udpipe.Model.tag` and :meth:`~corpy.udpipe.Model.parse` methods modify the sentence in place!