Common Voice Scripted Speech 24.0 - Palula
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 12/5/2025
Format: MP3
Size: 591.03 MB
Share
Description
A collection of scripted spoken phrases in Palula.
Specifics
Considerations
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
پالولا — Palula (phl)
This datasheet is for version 24.0 of the the Mozilla Common Voice Scripted Speech dataset
for Palula (phl). The dataset contains 21158 clips representing 29.27 hours of recorded
speech (21.52 hours validated) from 20 speakers.
Language
Palula is an Indo-Aryan language, specifically a branch of the Dardic group, closely related to Shina. Palula is spoken in several regions of Chitral District, including Ashret, Biori, Kalkatak, and parts of Shishi Koh. Beyond Chitral, it is also spoken in Gumandan (Dir Kohistan) and Sao village in Kunar Province, Afghanistan. The estimated number of speakers ranges between 20,000 and 25,000.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, percentage refers to the number of clips annotated with this gender.
| Gender | Percentage |
|---|---|
| Undefined | 100.0% |
Age
Self-declared age information, percentage refers to the number of clips annotated with this age band.
| Age Band | Percentage |
|---|---|
| Undefined | 6.0% |
| Twenties | 48.0% |
| Thirties | 23.0% |
| Teens | 6.0% |
| Fourties | 18.0% |
Text corpus
The Palula text corpus has been systematically compiled through fieldwork conducted within the Palula-speaking community, encompassing a diverse range of speakers across age groups and social backgrounds. Data was gathered through structured and semi-structured interviews, oral storytelling sessions, and community-based linguistic elicitation. All recordings were carefully transcribed.The corpus comprises 4000 sentences.
Writing system
Palula orthography based on the Arabic writing system
Symbol table
ا ب پ ت ث ٹ ج چ ڇ څ ح خ د ڈ ذ ر ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ں ݨ و ہ ھ ء ی ے
Sample
There follows a randomly selected sample of five sentences from the corpus.
انی گھوݜٹہ کتی کمرے ہنہ؟
اندہ دُو کمرے آک بہ پراجمی دیرہ ہنوۡ۔
تھی انیۡ کُڈ خختمی دِتیۡ ہِنم کی بٹومی؟
انیوے بُٹھے خختم استعمال بھِلم ہِنم، بٹومی کڈ دتی ہِنیۡ۔
تھی گھوݜٹہ کتی جانہ ہِنہ؟
Automatic random samples
سعیدے ساجد بیۡ مقابلہ وے شامل بِھلہ۔
مِیشہ وسیع تھےۡ منِیتوۡ کی مہ کُھنہ یھئی بہ بُٹانہ سمئینی اِزدہ تھہ۔
اسحاقہ عسیٰ تھےۡ تاوان کے دِتیۡ؟
ہیوندہ بِیڈیۡ جڑے بِھلِم۔
خلکِیم تسی بات کاݨ تِھیلیۡ۔ سےۡ بیلچے تھسکُورہ گِھنیۡ گیہ تںِم یاب شرو تِھیلیۡ۔
Sources
Palula Matli (Proverbs) – Author: Naseem Haider
Palula Textbook – Author: Naseem Haider
Palula Gul-Dasta Ashaar (Poetry) – Author: Naseem Haider
Palula–Urdu–English Conversation Guide – Author: Naseem Haider
Palula Folk Tales – Author: Naseem Haider
Text domains
| Domain | Count |
|---|---|
| Undefined | 21120 |
| Automotive Transport | 4 |
| Healthcare | 10 |
| History Law Government | 24 |
Processing
The process involved collecting texts, stories, and proverbs through community-based audio recordings. These recordings were transcribed, and selected sentences were curated for inclusion in the Common Voice dataset. The finalized dataset was then uploaded and subsequently recorded by multiple speakers.
Get involved!
Community links
Contribute
Acknowledgements
Datasheet authors
Naseem Haider naseemhaider78@gmail.com
Funding
This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.