Common Voice Scripted Speech 24.0 - Cornish
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 12/5/2025
Format: MP3
Size: 260.69 MB
Share
Description
A collection of scripted spoken phrases in Cornish.
Specifics
Considerations
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
Kernowek — Cornish (kw)
This datasheet is for version 24.0 of the the Mozilla Common Voice Scripted Speech dataset
for Cornish (kw). The dataset contains 11268 clips representing 12.95 hours of recorded
speech (12.42 hours validated) from 10 speakers.
Language
Cornish, or Kernewek, is a Brythonic language, alongside Breton and Welsh, and part of the Celtic Indo-European language family. It is an indigenous language of the United Kingdom, with most speakers located in Cornwall. In the 2021 UK Census 567 people self-identified Cornish as their main language. UNESCO has classified its status as "severely endangered".
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, percentage refers to the number of clips annotated with this gender.
| Gender | Percentage |
|---|---|
| Undefined | 66.0% |
| Female Feminine | 34.0% |
Age
Self-declared age information, percentage refers to the number of clips annotated with this age band.
| Age Band | Percentage |
|---|---|
| Undefined | 12.0% |
| Fourties | 34.0% |
| Fifties | 47.0% |
| Sixties | 2.0% |
| Seventies | 5.0% |
Text corpus
The dataset contains 10.8 validated hours of speech from 10 unique contributors.
| Type | Count | Hours |
|---|---|---|
| Validated Clips | 9,357 | 10.8 |
| Invalidated Clips | 0 | 0.00 |
| Total Clips | 9,357 | 10.8 |
Average sentence length (tokens): 6.4
Average sentence length (characters): 31
Writing system
Cornish has several writing systems in place. The majority of this dataset uses the Standard Written Form, established in 2008.
Symbol table
The dataset uses the following characters:
' - ! , . ? a b c d e f g h i j k l m n o p r s t u v w x y z
Sample
There follows a randomly selected sample of five sentences from the corpus.
A yllyn ni redya hemma?
Marthys ens i.
Esos. Yth esos ta ena y'n kornel.
A wrussyn ni diwrosa yn uskis?
Dha leveryans yw nebes da.
Automatic random samples
Yma agan lyvrow genen.
Yma agan lyvrow genowgh.
Gwrussys. Ty a wrug kana fest yn teg.
Yth esons i ryb an wariva.
Gwell via genev a pe yeynna.
Sources
The text for this dataset comes from the following sources:
IndyLan Cornish course. Author: Cornish Language Office. Standard Written Form.
Individual sentences submitted by users through the Mozilla Common Voice interface (public domain)
Text domains
| Domain | Count |
|---|---|
| General | 12725 |
Recommended post-processing
Check the data for Unicode errors in the Cornish. These should be the character
'.
Get involved!
Community links
Contribute
Acknowledgements
Datasheet authors
Sam Rogerson cornishlanguage@cornwall.gov.uk
Funding
This dataset was partially funded by the Open Multilingual Speech Fund.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.