Other

Shedding Light on the Diversity and Evolution of Languages

Shedding Light on the Diversity and Evolution of Languages

Scholars have created a new global repository of linguistic data. The project’s goal is to provide new insights into the evolution of words and sounds in today’s languages spoken around the world. The Lexibank database contains standardized lexical data for over 2000 languages. It is the most comprehensive publicly available collection compiled to date.

Scholars from the Max Planck Institute for Evolutionary Anthropology in Germany and the University of Auckland in New Zealand have created a new global repository of linguistic data. The project’s goal is to provide new insights into the evolution of words and sounds in today’s languages spoken around the world. The Lexibank database contains standardized lexical data for over 2000 languages. It is the most extensive publicly available collection compiled so far.

Is it true that many languages around the world have words for “mother” and “father” that sound similar to “mama” and “papa”? If a language has a single word for “arm” and “hand,” does it also have a single word for “leg” and “foot”? How do languages manage to express so many ideas with such a small number of words? An interdisciplinary team of linguists, computational scientists, and psychologists have created a large public database that can be used to investigate these and other questions using computational methods.

“When our Department of Linguistic and Cultural Evolution was founded in 2014, I presented my colleagues with an ambitious goal: there are more than 7000 languages in the world. Create databases with the most extensive documentation of the linguistic diversity as possible,” says Max Planck Director Russell Gray.

We designed new computer-assisted workflows that allow existing language datasets to be made comparable. We have dramatically increased the efficiency of data standardization and data curation with these workflows.

Johann-Mattis

“Our inspiration came from Genbank,” Gray continues. “Genbank is a large genetic database where biologists from all over the world have deposited genomic data.” “Genbank changed the game. The abundance of freely available sequence data has transformed how we can analyze biological diversity. We hope that Lexibank, the first of our global linguistic databases, will help to revolutionize our understanding of linguistic diversity in a similar way.”

New standards and new software

The Lexibank repository provides data in the form of standardized wordlists for more than 2000 language varieties. “The work on Lexibank coincided with a push towards more consistent data formats in linguistic databases. Thus Lexibank can serve both as a large-scale example of the benefits of standardization and a catalyst for further standardization,” reports Robert Forkel, who led the computational part of the data collection. “We decided to create our own standards, called Cross-Linguistic Data Formats, which have now been used successfully in a multitude of projects in which our department is involved.”

The team’s new standards are accompanied by new software tools that greatly simplify linguists’ workflows. “We designed new computer-assisted workflows that allow existing language datasets to be made comparable,” says Johann-Mattis List, who oversaw the practical aspects of data curation. “We have dramatically increased the efficiency of data standardization and data curation with these workflows.”

Shedding light on linguistic diversity and its evolution

Identifying patterns of language evolution

In addition to collecting and sharing the standardized language data, the authors also designed new computational techniques to answer questions about the evolution of linguistic diversity. They illustrate how these methods can be used by computing how languages differ or agree with respect to sixty different features.

“Thanks to our standardized representation of language data, it is now simple to see how many languages use words like’mama’ and ‘papa’ for’mother’ and ‘father,'” says List. “It turns out that this pattern can be found in many languages around the world and in very different regions,” says Simon J. Greenhill, one of the Lexibank project’s founders. “Because all of the languages with this pattern are not closely related to one another, it reflects independent parallel evolution, as the great linguist Roman Jakobson proposed in 1968.”

Expanding the data and developing new methods

The new data collection and automatically computed language features will help to shed light on unanswered questions about linguistic diversity and language evolution. “Nobody believes that the analysis must end with the examples we provide in our paper,” List says. “On the contrary, we hope that linguists, psychologists, and evolutionary scientists will be inspired to follow in our footsteps by expanding the data and developing new methods,” Forkel adds.

Even in their current study, the authors present findings that call for further research. “When we looked into which languages use the same word for ‘arm’ and ‘hand,’ we discovered that these languages also use the same word for ‘leg’ and ‘foot,'” List writes. “While this may appear to be a coincidental finding, it demonstrates that the lexicon of human languages is frequently much more structured than one might expect when studying one language in isolation.”