======================================
Wrangle corpora in the vertical format
======================================
Overview
========
Tools for parsing corpora in the vertical format devised originally for `CWB
`_, used also by `(No)SketchEngine
`_. It would have been nice if verticals
were just standards compliant XML, but they appeared before XML, so they're
not. Hence this.
NOTE: The examples below are currently not tested because they require the
:file:`syn2015.gz` vertical file to be available, which is large and should not
be freely distributed.
.. code:: python
>>> import pytest
>>> pytest.skip("examples not tested")
Iterating over positions in a vertical file
===========================================
This allows you to iterate over all positions while keeping track of the
structural attributes of the structures they're contained within, without
risking errors from hand-coding this logic every time you need it.
.. code:: python
>>> from corpy.vertical import Syn2015Vertical
>>> from pprint import pprint
>>> v = Syn2015Vertical("path/to/syn2015.gz")
>>> for i, position in enumerate(v.positions()):
... if i % 100 == 0:
... # structural attributes of position
... pprint(v.sattrs)
... print()
... # position itself
... pprint(position)
... print()
... elif i > 100:
... break
...
{'doc': {'audience': 'GEN: obecné publikum',
'author': 'Typlt, Jaromír',
'authsex': 'M: muž',
'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská '
'imaginace.',
'first_published': '1993',
'genre': 'X: neuvedeno',
'genre_group': 'X: neuvedeno',
'id': 'pi291',
'isbnissn': '80-7110-132-X',
'issue': '',
'medium': 'B: kniha',
'periodicity': 'NP: neperiodická publikace',
'publisher': 'Pražská imaginace',
'pubplace': 'Praha',
'pubyear': '1993',
'srclang': 'cs: čeština',
'subtitle': 'Groteskní mýtus',
'title': 'Zápas s rodokmenem',
'translator': 'X',
'transsex': 'X: neuvedeno',
'txtype': 'NOV: próza',
'txtype_group': 'FIC: beletrie'},
'p': {'id': 'pi291:1:1', 'type': 'normal'},
's': {'id': 'pi291:1:1:1'},
'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}}
Position(word='ZÁPAS', lemma='zápas', tag=UtklTag(pos='N', sub='N', gen='I', num='S', case='1', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='T', afun='ExD', parent='0', eparent='0', prep='', p_lemma='', p_tag='', p_afun='', ep_lemma='', ep_tag='', ep_afun='')
{'doc': {'audience': 'GEN: obecné publikum',
'author': 'Typlt, Jaromír',
'authsex': 'M: muž',
'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská '
'imaginace.',
'first_published': '1993',
'genre': 'X: neuvedeno',
'genre_group': 'X: neuvedeno',
'id': 'pi291',
'isbnissn': '80-7110-132-X',
'issue': '',
'medium': 'B: kniha',
'periodicity': 'NP: neperiodická publikace',
'publisher': 'Pražská imaginace',
'pubplace': 'Praha',
'pubyear': '1993',
'srclang': 'cs: čeština',
'subtitle': 'Groteskní mýtus',
'title': 'Zápas s rodokmenem',
'translator': 'X',
'transsex': 'X: neuvedeno',
'txtype': 'NOV: próza',
'txtype_group': 'FIC: beletrie'},
'p': {'id': 'pi291:1:3', 'type': 'normal'},
's': {'id': 'pi291:1:3:2'},
'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}}
Position(word='chvil', lemma='chvíle', tag=UtklTag(pos='N', sub='N', gen='F', num='P', case='2', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='M', afun='Atr', parent='-1', eparent='-1', prep='', p_lemma='několik', p_tag='Ca--4-----------', p_afun='Adv', ep_lemma='několik', ep_tag='Ca--4-----------', ep_afun='Adv')
Performing frequency distribution queries
=========================================
This can be done elegantly and fairly quickly with
:meth:`~corpy.vertical.Vertical.search`. All you have to do is provide a match
function, which identifies positions which the query should match, and a count
function, which specifies what should be counted for each match.
The return value is an index of occurrences and the total size of the corpus.
The index is a dictionary of numpy array of position indices within the corpus,
which can be further processed e.g. using :func:`~corpy.vertical.ipm` or
:func:`~corpy.vertical.arf` to compute different types of frequencies.
.. code:: python
>>> from corpy.vertical import Syn2015Vertical, ipm, arf
>>> v = Syn2015Vertical("path/to/syn2015.gz")
# log progress every 50M positions
>>> v.report = 50_000_000
>>> def match(posattrs, sattrs):
... # match all nouns within txtype_group "FIC: beletrie"
... return sattrs["doc"]["txtype_group"] == "FIC: beletrie" and posattrs.tag.pos == "N"
...
>>> def count(posattrs, sattrs):
... # at each matched position, record the txtype and lemma
... return sattrs["doc"]["txtype"], posattrs.lemma
...
>>> index, N = v.search(match, count)
Processed 0 lines in 0:00:00.007382.
Processed 50,000,000 lines in 0:05:58.185566.
Processed 100,000,000 lines in 0:11:35.394294.
**NOTE:** this was run on a desktop workstation, with the data being stored on
a networked filesystem. If the performance of any future versions on a similar
task becomes significantly worse than this ballpark, it should be considered a
bug.
.. code:: python
# absolute frequency
>>> len(index[("NOV: próza", "plíseň")])
211
# relative frequency (instances per million)
>>> ipm(index[("NOV: próza", "plíseň")], N)
1.747430618598555
# average reduced frequency (takes into account dispersion)
>>> arf(index[("NOV: próza", "plíseň")], N)
54.220727998809153
Subclass :class:`~corpy.vertical.Vertical` for your custom corpus
=================================================================
If you have a corpus with a different structure, you can easily adapt the tools
by subclassing :class:`~corpy.vertical.Vertical`. See its docstring for further
info, or the implementation of :class:`~corpy.vertical.Syn2015Vertical` for a
practical example.