Search
Items tagged with: metadata
If you're a #language nerd like I am, then you won't have missed the @mozilla #CommonVoice v19 #speech #dataset release - which now features 131 languages! Here's my #dataviz, done in @observablehq of the v19 #metadata coverage.
I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read.
There's some interesting observations:
▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.
▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!
▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government.
▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.
▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.
▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.
What do you make of the data visualisation? Are there any other insights you can see?
Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant.
Actually I thought that friendica strips by default all metadata when pictures are uploaded, something that #Diaspora* (apparently) does. I guess that's where my assumption came from. Has their been any discussion and/or decision on that?
Is here some addon that provides this?
I checked this because I'm diving into #exiftool and it's nice to be able to print copyright (CC-SA-NC), or things like "Artist", "Description or "Comment" to pictures and videos.
What I did find was in settings the opt-in option to publicly display the location metadata of pictures, yet what really would be neat is to be able differenciate these things. In other words, to strip the location data but retain other data like specifically added data.
As example two images I uploaded with similar metadata, one on diaspora:
pod.geraspora.de/uploads/image…
(this was actually a .png so diaspora changed the container)
and the friendica upload:
tupambae.org/photo/71953797316…
To see in linux (debian) what metadata shows up:apt-get install exiftool
To display the metadata:exiftool -v filename.png
The -v
in the command is optional and means "verbose", that means it displays more data than a simple:exiftool filename.png
Minimal Social Markup · Jens Oliver Meiert
Every website and app these days relies on so-called “social markup,” metadata for a richer and prettier display in social media and messaging tools. On the absolute minimum you may need.meiert.com