Search
Items tagged with: speech
If you're a #language nerd like I am, then you won't have missed the @mozilla #CommonVoice v19 #speech #dataset release - which now features 131 languages! Here's my #dataviz, done in @observablehq of the v19 #metadata coverage.
I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read.
There's some interesting observations:
▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.
▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!
▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government.
▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.
▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.
▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.
What do you make of the data visualisation? Are there any other insights you can see?
Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant.
I gave a talk about #spiel and #speech #tts in #linux at on #GUADEC #GUADEC2024 in Denver.
Check it out:
youtu.be/xseIsaxrlXo?feature=s…
GUADEC 2024 The Whole Spiel - A New Speech Synthesis API
Eitan Isaacson presents this GUADEC 2024 Day 3, Track 1 talk.Screen reader users have relied on speech synthesis for a long time. In recent years, speech int...YouTube
The Internet is for Everyone - Internet Society
Given by Vint Cerf at Computers, Freedom, and Privacy on April 7, 1999. How easy to say – how hard to achieve! Where are we in achieving this noble objective? The Internet is in its 11th year of annual doubling since 1988.Internet Society