close
close

Apre-salomemanzo

Breaking: Beyond Headlines!

How this grassroots effort could make AI voices more diverse
aecifo

How this grassroots effort could make AI voices more diverse

Ryakitimbo collected Kiswahili voice data in Tanzania, Kenya and the Democratic Republic of Congo. She tells me that she wanted to collect the voices of a socio-economically diverse set of Kiswahili speakers and that she reached out to women, young and old, living in rural areas, who were not always literate or n They didn’t even have access to the devices.

This type of data collection is challenging. The importance of AI voice data collection may seem abstract to many people, especially if they are not familiar with the technologies. Ryakitimbo and the volunteers approached women in settings where they initially felt safe, such as presentations on menstrual hygiene, and explained to them how technology could, for example, help disseminate information about menstruation. For women who could not read, the team read sentences which they repeated for the recording.

The Common Voice project is based on the belief that languages ​​are a very important part of identity. “We believe it’s not just about language, but also about transmitting culture and heritage and cherishing people’s particular cultural background,” says Lewis-Jong. “There are all kinds of idioms and cultural catchphrases that just don’t translate,” she adds.

Common Voice is the only audio dataset in which English does not dominate, says Willie Agnew, a researcher at Carnegie Mellon University who has studied audio datasets. “I’m very impressed with the quality of their work and how they created this data set that’s actually quite diverse,” Agnew says. “It feels like they’re way ahead of almost every other project we’ve looked at.”

I spent some time checking recordings of other Finnish speakers on the Common Voice platform. As their voices echoed through my office, I felt surprisingly touched. We were all united around the same cause: making AI data more inclusive and ensuring our culture and language are properly represented in the next generation of AI tools.

But I had big questions about what would happen to my voice if I donated it. Once it was in the dataset, I would have no control over how it might be used afterwards. The tech industry isn’t exactly known for give people proper credit, and the data is accessible to everyone.

“While we want this to benefit local communities, there is a possibility that big tech will also use the same data and create something that then becomes a commercial product,” says Ryakitimbo. Although Mozilla doesn’t say who downloaded Common Voice, Lewis-Jong tells me that Meta and Nvidia have reported using it.

Open access to this rare and hard-won linguistic data is not something all minority groups want, says Harry H. Jiang, a researcher at Carnegie Mellon University who was part of the research team. audit. For example, Indigenous groups raised concerns.