====================================== Wrangle corpora in the vertical format ====================================== Overview ======== Tools for parsing corpora in the vertical format devised originally for `CWB `_, used also by `(No)SketchEngine `_. It would have been nice if verticals were just standards compliant XML, but they appeared before XML, so they're not. Hence this. NOTE: The examples below are currently not tested because they require the :file:`syn2015.gz` vertical file to be available, which is large and should not be freely distributed. .. code:: python >>> import pytest >>> pytest.skip("examples not tested") Iterating over positions in a vertical file =========================================== This allows you to iterate over all positions while keeping track of the structural attributes of the structures they're contained within, without risking errors from hand-coding this logic every time you need it. .. code:: python >>> from corpy.vertical import Syn2015Vertical >>> from pprint import pprint >>> v = Syn2015Vertical("path/to/syn2015.gz") >>> for i, position in enumerate(v.positions()): ... if i % 100 == 0: ... # structural attributes of position ... pprint(v.sattrs) ... print() ... # position itself ... pprint(position) ... print() ... elif i > 100: ... break ... {'doc': {'audience': 'GEN: obecné publikum', 'author': 'Typlt, Jaromír', 'authsex': 'M: muž', 'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská ' 'imaginace.', 'first_published': '1993', 'genre': 'X: neuvedeno', 'genre_group': 'X: neuvedeno', 'id': 'pi291', 'isbnissn': '80-7110-132-X', 'issue': '', 'medium': 'B: kniha', 'periodicity': 'NP: neperiodická publikace', 'publisher': 'Pražská imaginace', 'pubplace': 'Praha', 'pubyear': '1993', 'srclang': 'cs: čeština', 'subtitle': 'Groteskní mýtus', 'title': 'Zápas s rodokmenem', 'translator': 'X', 'transsex': 'X: neuvedeno', 'txtype': 'NOV: próza', 'txtype_group': 'FIC: beletrie'}, 'p': {'id': 'pi291:1:1', 'type': 'normal'}, 's': {'id': 'pi291:1:1:1'}, 'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}} Position(word='ZÁPAS', lemma='zápas', tag=UtklTag(pos='N', sub='N', gen='I', num='S', case='1', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='T', afun='ExD', parent='0', eparent='0', prep='', p_lemma='', p_tag='', p_afun='', ep_lemma='', ep_tag='', ep_afun='') {'doc': {'audience': 'GEN: obecné publikum', 'author': 'Typlt, Jaromír', 'authsex': 'M: muž', 'biblio': 'Typlt, Jaromír (1993): Zápas s rodokmenem. Praha: Pražská ' 'imaginace.', 'first_published': '1993', 'genre': 'X: neuvedeno', 'genre_group': 'X: neuvedeno', 'id': 'pi291', 'isbnissn': '80-7110-132-X', 'issue': '', 'medium': 'B: kniha', 'periodicity': 'NP: neperiodická publikace', 'publisher': 'Pražská imaginace', 'pubplace': 'Praha', 'pubyear': '1993', 'srclang': 'cs: čeština', 'subtitle': 'Groteskní mýtus', 'title': 'Zápas s rodokmenem', 'translator': 'X', 'transsex': 'X: neuvedeno', 'txtype': 'NOV: próza', 'txtype_group': 'FIC: beletrie'}, 'p': {'id': 'pi291:1:3', 'type': 'normal'}, 's': {'id': 'pi291:1:3:2'}, 'text': {'author': '', 'id': 'pi291:1', 'section': '', 'section_orig': ''}} Position(word='chvil', lemma='chvíle', tag=UtklTag(pos='N', sub='N', gen='F', num='P', case='2', pgen='-', pnum='-', pers='-', tense='-', grad='-', neg='A', act='-', p13='-', p14='-', var='-', asp='-'), proc='M', afun='Atr', parent='-1', eparent='-1', prep='', p_lemma='několik', p_tag='Ca--4-----------', p_afun='Adv', ep_lemma='několik', ep_tag='Ca--4-----------', ep_afun='Adv') Performing frequency distribution queries ========================================= This can be done elegantly and fairly quickly with :meth:`~corpy.vertical.Vertical.search`. All you have to do is provide a match function, which identifies positions which the query should match, and a count function, which specifies what should be counted for each match. The return value is an index of occurrences and the total size of the corpus. The index is a dictionary of numpy array of position indices within the corpus, which can be further processed e.g. using :func:`~corpy.vertical.ipm` or :func:`~corpy.vertical.arf` to compute different types of frequencies. .. code:: python >>> from corpy.vertical import Syn2015Vertical, ipm, arf >>> v = Syn2015Vertical("path/to/syn2015.gz") # log progress every 50M positions >>> v.report = 50_000_000 >>> def match(posattrs, sattrs): ... # match all nouns within txtype_group "FIC: beletrie" ... return sattrs["doc"]["txtype_group"] == "FIC: beletrie" and posattrs.tag.pos == "N" ... >>> def count(posattrs, sattrs): ... # at each matched position, record the txtype and lemma ... return sattrs["doc"]["txtype"], posattrs.lemma ... >>> index, N = v.search(match, count) Processed 0 lines in 0:00:00.007382. Processed 50,000,000 lines in 0:05:58.185566. Processed 100,000,000 lines in 0:11:35.394294. **NOTE:** this was run on a desktop workstation, with the data being stored on a networked filesystem. If the performance of any future versions on a similar task becomes significantly worse than this ballpark, it should be considered a bug. .. code:: python # absolute frequency >>> len(index[("NOV: próza", "plíseň")]) 211 # relative frequency (instances per million) >>> ipm(index[("NOV: próza", "plíseň")], N) 1.747430618598555 # average reduced frequency (takes into account dispersion) >>> arf(index[("NOV: próza", "plíseň")], N) 54.220727998809153 Subclass :class:`~corpy.vertical.Vertical` for your custom corpus ================================================================= If you have a corpus with a different structure, you can easily adapt the tools by subclassing :class:`~corpy.vertical.Vertical`. See its docstring for further info, or the implementation of :class:`~corpy.vertical.Syn2015Vertical` for a practical example.