Wrangle corpora in the vertical format


Tools for parsing corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine. It would have been nice if verticals were just standards compliant XML, but they appeared before XML, so they’re not. Hence this.

NOTE: The examples below are currently not tested because they require the syn2015.gz vertical file to be available, which is large and should not be freely distributed.

>>> import pytest
>>> pytest.skip("examples not tested")

Iterating over positions in a vertical file

This allows you to iterate over all positions while keeping track of the structural attributes of the structures they’re contained within, without risking errors from hand-coding this logic every time you need it.

>>> from corpy.vertical import Syn2015Vertical
>>> from pprint import pprint
>>> v = Syn2015Vertical("path/to/syn2015.gz")
>>> for i, position in enumerate(v.positions()):
...     if i % 100 == 0:
...         # structural attributes of position
...         pprint(v.sattrs)
...         print()
...         # position itself
...         pprint(position)
...         print()
...     elif i > 100:
...         break
{'doc': {'audience': 'GEN: obecné publikum',
        'author': 'Typlt, Jaromír',
        'authsex': 'M: muž',
        'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská '
        'first_published': '1993',
        'genre': 'X: neuvedeno',
        'genre_group': 'X: neuvedeno',
        'id': 'pi291',
        'isbnissn': '80-7110-132-X',
        'issue': '',
        'medium': 'B: kniha',
        'periodicity': 'NP: neperiodická publikace',
        'publisher': 'Pražská imaginace',
        'pubplace': 'Praha',
        'pubyear': '1993',
        'srclang': 'cs: čeština',
        'subtitle': 'Groteskní mýtus',
        'title': 'Zápas s rodokmenem',
        'translator': 'X',
        'transsex': 'X: neuvedeno',
        'txtype': 'NOV: próza',
        'txtype_group': 'FIC: beletrie'},
'p': {'id': 'pi291:1:1', 'type': 'normal'},
's': {'id': 'pi291:1:1:1'},
'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}}

Position(word='ZÁPAS', lemma='zápas', tag=UtklTag(pos='N', sub='N', gen='I', num='S', case='1', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='T', afun='ExD', parent='0', eparent='0', prep='', p_lemma='', p_tag='', p_afun='', ep_lemma='', ep_tag='', ep_afun='')

{'doc': {'audience': 'GEN: obecné publikum',
        'author': 'Typlt, Jaromír',
        'authsex': 'M: muž',
        'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská '
        'first_published': '1993',
        'genre': 'X: neuvedeno',
        'genre_group': 'X: neuvedeno',
        'id': 'pi291',
        'isbnissn': '80-7110-132-X',
        'issue': '',
        'medium': 'B: kniha',
        'periodicity': 'NP: neperiodická publikace',
        'publisher': 'Pražská imaginace',
        'pubplace': 'Praha',
        'pubyear': '1993',
        'srclang': 'cs: čeština',
        'subtitle': 'Groteskní mýtus',
        'title': 'Zápas s rodokmenem',
        'translator': 'X',
        'transsex': 'X: neuvedeno',
        'txtype': 'NOV: próza',
        'txtype_group': 'FIC: beletrie'},
'p': {'id': 'pi291:1:3', 'type': 'normal'},
's': {'id': 'pi291:1:3:2'},
'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}}

Position(word='chvil', lemma='chvíle', tag=UtklTag(pos='N', sub='N', gen='F', num='P', case='2', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='M', afun='Atr', parent='-1', eparent='-1', prep='', p_lemma='několik', p_tag='Ca--4-----------', p_afun='Adv', ep_lemma='několik', ep_tag='Ca--4-----------', ep_afun='Adv')

Performing frequency distribution queries

This can be done elegantly and fairly quickly with search(). All you have to do is provide a match function, which identifies positions which the query should match, and a count function, which specifies what should be counted for each match.

The return value is an index of occurrences and the total size of the corpus. The index is a dictionary of numpy array of position indices within the corpus, which can be further processed e.g. using ipm() or arf() to compute different types of frequencies.

>>> from corpy.vertical import Syn2015Vertical, ipm, arf
>>> v = Syn2015Vertical("path/to/syn2015.gz")
# log progress every 50M positions
>>> v.report = 50_000_000
>>> def match(posattrs, sattrs):
...     # match all nouns within txtype_group "FIC: beletrie"
...     return sattrs["doc"]["txtype_group"] == "FIC: beletrie" and posattrs.tag.pos == "N"
>>> def count(posattrs, sattrs):
...     # at each matched position, record the txtype and lemma
...     return sattrs["doc"]["txtype"], posattrs.lemma
>>> index, N = v.search(match, count)
Processed 0 lines in 0:00:00.007382.
Processed 50,000,000 lines in 0:05:58.185566.
Processed 100,000,000 lines in 0:11:35.394294.

NOTE: this was run on a desktop workstation, with the data being stored on a networked filesystem. If the performance of any future versions on a similar task becomes significantly worse than this ballpark, it should be considered a bug.

# absolute frequency
>>> len(index[("NOV: próza", "plíseň")])
# relative frequency (instances per million)
>>> ipm(index[("NOV: próza", "plíseň")], N)
# average reduced frequency (takes into account dispersion)
>>> arf(index[("NOV: próza", "plíseň")], N)

Subclass Vertical for your custom corpus

If you have a corpus with a different structure, you can easily adapt the tools by subclassing Vertical. See its docstring for further info, or the implementation of Syn2015Vertical for a practical example.