corpy.vertical
¶
Parse and query corpora in the vertical format.
-
class
corpy.vertical.
Vertical
(path)¶ Base class for a corpus in the vertical format.
Create subclasses for specific corpora by at least specifying a list of
struct_names
andposattrs
as class attributes.- Parameters
path (str) – Path to the vertical file to work with.
-
open
()¶ Open the vertical file in
self.path
.Override this method in subclasses to specify alternative ways of opening, e.g. using
gzip.open()
.
-
parse_position
(position)¶ Parse a single position from the vertical.
Override this method in subclasses to hook into the position parsing process.
-
posattrs
= []¶ A list of expected positional attributes.
-
positions
(parse_sattrs=True, ignore_fn=None, hook_fn=None)¶ Iterate over the positions in the vertical.
At any point during the iteration, the structural attributes corresponding to the current position are accessible via
self.sattrs
.- Parameters
parse_sattrs (bool) – Whether to parse structural attrs into a dict (default) or just leave the original string (faster).
ignore_fn (function(posattrs, sattrs)) – If given, then evaluated at each position; if it returns
True
, then the position is completely ignored.hook_fn (function(posattrs, sattrs)) – If given, then evaluated at each position.
-
search
(match_fn, count_fn=None, **kwargs)¶ Search the vertical, creating an index of what’s been found.
- Parameters
match_fn (function(match_fn, count_fn)) – Evaluated at each position to see if the position matches the given search.
count_fn – Evaluated at each matching position to determine what should be counted at that position (in the sense of being tallied as part of the resulting frequency distribution). If it returns a list, it’s understood as a list of things to count.
kwargs – Passed on to
positions()
.
- Returns
The frequency index of counted “things” and the size of the corpus.
- Return type
(dict, int)
-
struct_names
= []¶ A list of expected structural attribute tag names.
-
class
corpy.vertical.
Syn2015Vertical
(path)¶ A subclass of
Vertical
for the SYN2015 corpus.Refer to
Vertical
for API details.-
open
()¶ Open the vertical file in
self.path
.Override this method in subclasses to specify alternative ways of opening, e.g. using
gzip.open()
.
-
parse_position
(position)¶ Parse a single position from the vertical.
Override this method in subclasses to hook into the position parsing process.
-
-
corpy.vertical.
ipm
(occurrences, N)¶ Relative frequency of occurrences in corpus, in instances per million.
-
corpy.vertical.
arf
(occurrences, N)¶ Average reduced frequency of occurrences in corpus.