corpy.vertical#

Parse and query corpora in the vertical format.

class corpy.vertical.Vertical(path)#

Base class for a corpus in the vertical format.

Create subclasses for specific corpora by at least specifying a list of struct_names and posattrs as class attributes.

Parameters:

path (str) – Path to the vertical file to work with.

struct_names: List[str] = []#

A list of expected structural attribute tag names.

posattrs: List[str] = []#

A list of expected positional attributes.

open()#

Open the vertical file in self.path.

Override this method in subclasses to specify alternative ways of opening, e.g. using gzip.open().

parse_position(position)#

Parse a single position from the vertical.

Override this method in subclasses to hook into the position parsing process.

positions(parse_sattrs=True, ignore_fn=None, hook_fn=None)#

Iterate over the positions in the vertical.

At any point during the iteration, the structural attributes corresponding to the current position are accessible via self.sattrs.

Parameters:
  • parse_sattrs (bool) – Whether to parse structural attrs into a dict (default) or just leave the original string (faster).

  • ignore_fn (function(posattrs, sattrs)) – If given, then evaluated at each position; if it returns True, then the position is completely ignored.

  • hook_fn (function(posattrs, sattrs)) – If given, then evaluated at each position.

search(match_fn, count_fn=None, **kwargs)#

Search the vertical, creating an index of what’s been found.

Parameters:
  • match_fn (function(match_fn, count_fn)) – Evaluated at each position to see if the position matches the given search.

  • count_fn – Evaluated at each matching position to determine what should be counted at that position (in the sense of being tallied as part of the resulting frequency distribution). If it returns a list, it’s understood as a list of things to count.

  • kwargs – Passed on to positions().

Returns:

The frequency index of counted “things” and the size of the corpus.

Return type:

(dict, int)

class corpy.vertical.Syn2015Vertical(path)#

A subclass of Vertical for the SYN2015 corpus.

Refer to Vertical for API details.

struct_names: List[str] = ['doc', 'text', 'p', 's', 'hi', 'lb']#

A list of expected structural attribute tag names.

posattrs: List[str] = ['word', 'lemma', 'tag', 'proc', 'afun', 'parent', 'eparent', 'prep', 'p_lemma', 'p_tag', 'p_afun', 'ep_lemma', 'ep_tag', 'ep_afun']#

A list of expected positional attributes.

open()#

Open the vertical file in self.path.

Override this method in subclasses to specify alternative ways of opening, e.g. using gzip.open().

parse_position(position)#

Parse a single position from the vertical.

Override this method in subclasses to hook into the position parsing process.

corpy.vertical.ipm(occurrences, N)#

Relative frequency of occurrences in corpus, in instances per million.

corpy.vertical.arf(occurrences, N)#

Average reduced frequency of occurrences in corpus.

class corpy.vertical.ShuffledSyn2015Vertical(path)#

A subclass of Vertical for the SYN2015 corpus, shuffled.

Refer to Vertical for API details.