Annotated Speech Corpora

CDAC, Kolkata has developed Annotated Speech Corpora for 3 East Indian Languages viz. Bangla, Assamese and Manipuri (Sponsored by TDIL, DeitY). All the informants of the corpora are professional voice over artist. The speech is recorded in a speech studio environment and digitized at a sampling rate of 22,050 Hz with an accuracy of 16 bits/sample in PCM wave format. The annotation has been done both at text level and speech level. At text level Parts of Speech (POS), Phrase and Clause have been annotated. Text files are also phonetically transcribed in Internal Phonetic Alphabet (IPA). In case of speech, phonemes, syllables and breath pause have been annotated. The total size of the speech corpora is about 8.5GB. Majority of this Corpus is for Bangla Language (5.12 GB). Only standard dialect of a particular language is included in this corpora.

Figure : Screenshot of Annotated Speech Corpora

The content of the corpora has been designed in a way that it can help various aspects of speech research such as Speech Synthesis, Speech Recognition, Speaker Recognition etc.