The aim of expanding language technologies is to improve the capabilities and performance of natural language processing (NLP) systems, such as machine translation, text summarization, and speech recognition. This can be achieved through various methods, such as increasing the size of training data sets, fine-tuning models on specialized tasks, or developing new architectures and techniques. The ultimate goal is to enable these systems to understand and generate human language more effectively, allowing them to be applied in a wider range of applications and industries.
Only a small percentage of the world’s 7,000 to 8,000 languages benefit from modern language technologies such as voice-to-text transcription, automatic captioning, instant translation, and voice recognition. Researchers at Carnegie Mellon University want to increase the number of languages supported by automatic speech recognition tools from around 200 to potentially 2,000.
“A lot of people around the world speak different languages, but language technology tools for all of them aren’t being developed,” said Xinjian Li, a Ph.D. student at the School of Computer Science’s Language Technologies Institute (LTI). “One of the goals of this research is to develop technology and a good language model for all people.”
A lot of people around the world speak different languages, but language technology tools for all of them aren’t being developed. One of the goals of this research is to develop technology and a good language model for all people.
Xinjian Li
Li is part of a research team aiming to simplify the data requirements languages need to create a speech recognition model. The team – which also includes LTI faculty members Shinji Watanabe, Florian Metze, David Mortensen, and Alan Black – presented their most recent work, “ASR2K: Speech Recognition for Around 2,000 Languages Without Audio,” at Interspeech 2022 in South Korea.
Most speech recognition models require two data sets: text and audio. Text data exists for thousands of languages. Audio data does not. The team hopes to eliminate the need for audio data by focusing on linguistic elements common across many languages.
Historically, speech recognition technologies focus on a language’s phoneme. These distinct sounds that distinguish one word from another – like the “d” that differentiates “dog” from “log” and “cog” – are unique to each language. But languages also have phones, which describe how a word sounds physically. Multiple phones might correspond to a single phoneme. So even though separate languages may have different phonemes, their underlying phones could be the same.
The LTI team is developing a speech recognition model that moves away from phonemes and instead relies on information about how phones are shared between languages, thereby reducing the effort to build separate models for each language. Specifically, it pairs the model with a phylogenetic tree – a diagram that maps the relationships between languages – to help with pronunciation rules. Through their model and the tree structure, the team can approximate the speech model for thousands of languages without audio data.
“We are trying to remove this audio data requirement, which helps us move from 100 or 200 languages to 2,000,” Li said. “This is the first research to target such a large number of languages, and we’re the first team aiming to expand language tools to this scope.”
Still in an early stage, the research has improved existing language approximation tools by a modest 5%, but the team hopes it will serve as an inspiration not only for their future work but also for that of other researchers. For Li, the work means more than making language technologies available to all. It’s about cultural preservation.
“Each language is a very important factor in its culture. Each language has its own story, and if you don’t try to preserve languages, those stories might be lost,” Li said. “Developing this kind of speech recognition system and this tool is a step to try to preserve those languages.”