corpy.vertical
#
Parse and query corpora in the vertical format.
- class corpy.vertical.Vertical(path)#
Base class for a corpus in the vertical format.
Create subclasses for specific corpora by at least specifying a list of
struct_names
andposattrs
as class attributes.- Parameters:
path (str) – Path to the vertical file to work with.
- struct_names: List[str] = []#
A list of expected structural attribute tag names.
- posattrs: List[str] = []#
A list of expected positional attributes.
- open()#
Open the vertical file in
self.path
.Override this method in subclasses to specify alternative ways of opening, e.g. using
gzip.open()
.
- parse_position(position)#
Parse a single position from the vertical.
Override this method in subclasses to hook into the position parsing process.
- positions(parse_sattrs=True, ignore_fn=None, hook_fn=None)#
Iterate over the positions in the vertical.
At any point during the iteration, the structural attributes corresponding to the current position are accessible via
self.sattrs
.- Parameters:
parse_sattrs (bool) – Whether to parse structural attrs into a dict (default) or just leave the original string (faster).
ignore_fn (function(posattrs, sattrs)) – If given, then evaluated at each position; if it returns
True
, then the position is completely ignored.hook_fn (function(posattrs, sattrs)) – If given, then evaluated at each position.
- search(match_fn, count_fn=None, **kwargs)#
Search the vertical, creating an index of what’s been found.
- Parameters:
match_fn (function(match_fn, count_fn)) – Evaluated at each position to see if the position matches the given search.
count_fn – Evaluated at each matching position to determine what should be counted at that position (in the sense of being tallied as part of the resulting frequency distribution). If it returns a list, it’s understood as a list of things to count.
kwargs – Passed on to
positions()
.
- Returns:
The frequency index of counted “things” and the size of the corpus.
- Return type:
(dict, int)
- class corpy.vertical.Syn2015Vertical(path)#
A subclass of
Vertical
for the SYN2015 corpus.Refer to
Vertical
for API details.- struct_names: List[str] = ['doc', 'text', 'p', 's', 'hi', 'lb']#
A list of expected structural attribute tag names.
- posattrs: List[str] = ['word', 'lemma', 'tag', 'proc', 'afun', 'parent', 'eparent', 'prep', 'p_lemma', 'p_tag', 'p_afun', 'ep_lemma', 'ep_tag', 'ep_afun']#
A list of expected positional attributes.
- open()#
Open the vertical file in
self.path
.Override this method in subclasses to specify alternative ways of opening, e.g. using
gzip.open()
.
- parse_position(position)#
Parse a single position from the vertical.
Override this method in subclasses to hook into the position parsing process.
- corpy.vertical.ipm(occurrences, N)#
Relative frequency of occurrences in corpus, in instances per million.
- corpy.vertical.arf(occurrences, N)#
Average reduced frequency of occurrences in corpus.