corpy.phonetics.cs#

Perform rule-based phonetic transcription of Czech.

Some frequent exceptions to the otherwise fairly regular orthography-to-phonetics mapping are overridden using a pronunciation lexicon.

class corpy.phonetics.cs.Phone(value: str, *, word_boundary: bool = False)#

A single phone.

You probably don’t need to create these by hand. They’re used internally by ProsodicUnit to keep track of word boundaries while keeping all the phones in a flat list.

class corpy.phonetics.cs.ProsodicUnit(orthographic: List[str])#

A prosodic unit which should be transcribed as a whole.

This means that various connected speech processes are emulated at word boundaries within the unit as well as within words.

Parameters:

orthographic (list of str) – The orthographic transcript of the prosodic unit.

phonetic(*, alphabet: str = 'SAMPA', hiatus=False, tagger: Tagger | None = None) List[Tuple[str, ...]]#

Phonetic transcription of ProsodicUnit.

corpy.phonetics.cs.transcribe(phrase: str | Iterable[str], *, alphabet='sampa', hiatus=False, tagger: Tagger | None = None, prosodic_boundary_symbols: Set[str] | None = None) List[str | Tuple[str, ...]]#

Phonetically transcribe phrase.

Note

It is highly recommended to provide an instance of corpy.morphodita.Tagger via the tagger argument. This enables smarter treatment of vowel sequences emerging as a result of prefixing. Without a tagger, both e.g. neuron and neurozený will have -eu- transcribed as a diphthong, even though it’s only appropriate in the first case.

A few simple cases are covered even in the absence of a tagger via the exceptions mechanism: search for - in exceptions.tsv.

phrase is either a string (in which case it is split on whitespace) or an iterable of strings (in which case it’s considered as already tokenized by the user).

Transcription is attempted for tokens which consist purely of alphabetical characters and possibly hyphens (-). Other tokens are passed through unchanged. Hyphens have a special role: they prevent interactions between graphemes or phones from taking place, which means you can e.g. cancel assimilation of voicing in a cluster like tb by inserting a hyphen between the graphemes: t-b. They are removed from the final output. If you want a literal hyphen, it must be inside a token with either no alphabetic characters, or at least one other non-alphabetic character (e.g. -, ---, -hlad?, etc.).

Returns a list where transcribed tokens are represented as tuples of strings (phones) and non-transcribed tokens (which were just passed through as-is) as plain strings.

alphabet is one of SAMPA, IPA, CS or CNC (case insensitive) and determines the symbol alphabet used in the phonetic transcript.

When hiatus=True, a /j/ phone is added between a high front vowel and a subsequent vowel.

Various connected speech processes such as assimilation of voicing are emulated even across word boundaries. By default, this happens irrespective of intervening non-transcribed tokens. If you want some types of non-transcribed tokens to constitute an obstacle to interactions between phones, pass them as a set via the prosodic_boundary_symbols argument. E.g. prosodic_boundary_symbols={"?", ".."} will prevent CSPs from being emulated across ? and .. tokens.