Common Voice Scripted Speech 24.0 - Palula

Specifics

Licensing

CC0 1.0 Universal

https://creativecommons.org/publicdomain/zero/1.0/legalcode

Considerations

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

پالولا — Palula (`phl`)

This datasheet is for version 24.0 of the the Mozilla Common Voice Scripted Speech dataset for Palula (phl). The dataset contains 21158 clips representing 29.27 hours of recorded speech (21.52 hours validated) from 20 speakers.

Language

Palula is an Indo-Aryan language, specifically a branch of the Dardic group, closely related to Shina. Palula is spoken in several regions of Chitral District, including Ashret, Biori, Kalkatak, and parts of Shishi Koh. Beyond Chitral, it is also spoken in Gumandan (Dir Kohistan) and Sao village in Kunar Province, Afghanistan. The estimated number of speakers ranges between 20,000 and 25,000.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, percentage refers to the number of clips annotated with this gender.

Gender	Percentage
Undefined	100.0%

Age

Self-declared age information, percentage refers to the number of clips annotated with this age band.

Age Band	Percentage
Undefined	6.0%
Twenties	48.0%
Thirties	23.0%
Teens	6.0%
Fourties	18.0%

Text corpus

The Palula text corpus has been systematically compiled through fieldwork conducted within the Palula-speaking community, encompassing a diverse range of speakers across age groups and social backgrounds. Data was gathered through structured and semi-structured interviews, oral storytelling sessions, and community-based linguistic elicitation. All recordings were carefully transcribed.The corpus comprises 4000 sentences.

Writing system

Palula orthography based on the Arabic writing system

Symbol table

ا ب پ ت ث ٹ ج چ ڇ څ ح خ د ڈ ذ ر ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ں ݨ و ہ ھ ء ی ے

Sample

There follows a randomly selected sample of five sentences from the corpus.

انی گھوݜٹہ کتی کمرے ہنہ؟
اندہ دُو کمرے آک بہ پراجمی دیرہ ہنوۡ۔
تھی انیۡ کُڈ خختمی دِتیۡ ہِنم کی بٹومی؟
انیوے بُٹھے خختم استعمال بھِلم ہِنم، بٹومی کڈ دتی ہِنیۡ۔
تھی گھوݜٹہ کتی جانہ ہِنہ؟

Automatic random samples

سعیدے ساجد بیۡ مقابلہ وے شامل بِھلہ۔
مِیشہ وسیع تھےۡ منِیتوۡ کی مہ کُھنہ یھئی بہ بُٹانہ سمئینی اِزدہ تھہ۔
اسحاقہ عسیٰ تھےۡ تاوان کے دِتیۡ؟
ہیوندہ بِیڈیۡ جڑے بِھلِم۔
خلکِیم تسی بات کاݨ تِھیلیۡ۔ سےۡ بیلچے تھسکُورہ گِھنیۡ گیہ تںِم یاب شرو تِھیلیۡ۔

Sources

Palula Matli (Proverbs) – Author: Naseem Haider
Palula Textbook – Author: Naseem Haider
Palula Gul-Dasta Ashaar (Poetry) – Author: Naseem Haider
Palula–Urdu–English Conversation Guide – Author: Naseem Haider
Palula Folk Tales – Author: Naseem Haider

Text domains

Domain	Count
Undefined	21120
Automotive Transport	4
Healthcare	10
History Law Government	24

Processing

The process involved collecting texts, stories, and proverbs through community-based audio recordings. These recordings were transcribed, and selected sentences were curated for inclusion in the Common Voice dataset. The finalized dataset was then uploaded and subsequently recorded by multiple speakers.

Get involved!

Community links

Contribute

Acknowledgements

Datasheet authors

Naseem Haider naseemhaider78@gmail.com

Funding

This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.