corpy.vertical

Parse and query corpora in the vertical format.

class corpy.vertical.Vertical(path)

Base class for a corpus in the vertical format.

Create subclasses for specific corpora by at least specifying a list of struct_names and posattrs as class attributes.

Parameters

path (str) – Path to the vertical file to work with.

struct_names: List[str] = []

A list of expected structural attribute tag names.

posattrs: List[str] = []

A list of expected positional attributes.

open()

Open the vertical file in self.path.

Override this method in subclasses to specify alternative ways of opening, e.g. using gzip.open().

parse_position(position)

Parse a single position from the vertical.

Override this method in subclasses to hook into the position parsing process.

positions(parse_sattrs=True, ignore_fn=None, hook_fn=None)

Iterate over the positions in the vertical.

At any point during the iteration, the structural attributes corresponding to the current position are accessible via self.sattrs.

Parameters
  • parse_sattrs (bool) – Whether to parse structural attrs into a dict (default) or just leave the original string (faster).

  • ignore_fn (function(posattrs, sattrs)) – If given, then evaluated at each position; if it returns True, then the position is completely ignored.

  • hook_fn (function(posattrs, sattrs)) – If given, then evaluated at each position.

search(match_fn, count_fn=None, **kwargs)

Search the vertical, creating an index of what’s been found.

Parameters
  • match_fn (function(match_fn, count_fn)) – Evaluated at each position to see if the position matches the given search.

  • count_fn – Evaluated at each matching position to determine what should be counted at that position (in the sense of being tallied as part of the resulting frequency distribution). If it returns a list, it’s understood as a list of things to count.

  • kwargs – Passed on to positions().

Returns

The frequency index of counted “things” and the size of the corpus.

Return type

(dict, int)

class corpy.vertical.Syn2015Vertical(path)

A subclass of Vertical for the SYN2015 corpus.

Refer to Vertical for API details.

open()

Open the vertical file in self.path.

Override this method in subclasses to specify alternative ways of opening, e.g. using gzip.open().

parse_position(position)

Parse a single position from the vertical.

Override this method in subclasses to hook into the position parsing process.

corpy.vertical.ipm(occurrences, N)

Relative frequency of occurrences in corpus, in instances per million.

corpy.vertical.arf(occurrences, N)

Average reduced frequency of occurrences in corpus.

class corpy.vertical.ShuffledSyn2015Vertical(path)

A subclass of Vertical for the SYN2015 corpus, shuffled.

Refer to Vertical for API details.