Since the 1700s, taxonomy has been the study of how living entities relate to one another as species. Taxonomists treat each species as a group of organisms that have common biological traits, despite the fact that scientists and philosophers have long argued about what constitutes a species. Because biology researchers and conservationists utilize species as a unit of analysis, discovering and describing new species is critical. Species are also economically significant to agriculture, hunting, and fishing, and they have unique legal protections, such as the Endangered Species Act in the United States.
Despite this, based on historical discovery trends, scientists have been able to properly name and characterize only around 10% of the world’s species. The Linnean deficit refers to this information gap. It’s unclear whether this discrepancy is due to inadequate research methods, conflicts over how to designate a species or other issues.
We’re evolutionary biologists, and figuring out how to better identify species is a big part of what we do. We were able to untangle hidden species that had been put together in a single group and anticipate where and what types they might be using genetic analysis and artificial intelligence. Our findings also hint to a possible cause for the lack of species identification: a lack of investment in taxonomy science.
THERE ARE STILL HIDDEN SPECIES TO BE FOUND. We decided to focus on animals for this research. We projected that a high majority of mammalian species had previously been identified due to their relatively large size and relevance to people as a source of food, companionship, and entertainment. The initial step was to identify any known species that could be made up of two or more species.
We did this by analyzing 1 million gene sequences from 4,300 identified species, identifying sequence clusters with significant genetic diversity, and fitting the data to an evolutionary model. We discovered hundreds of hitherto unknown species that were previously classed as a single group. This outcome was predicted, as it confirms earlier research findings, albeit on a greater scale.
WHAT ARE THESE HIDDEN SPECIES AND WHERE DO THEY LIVE? After we discovered the presence of these potentially hidden species, the next step was to figure out what characteristics they shared. To do so, we employed a data science technique known as random forest analysis, which is a type of machine learning that uses data from a vast number of distinct variables to predict a specific outcome. It’s similar to the method Netflix use to recommend shows you might enjoy watching.
In our scenario, we were trying to figure out if a known species had any hidden species. Environmental characteristics, such as the climate of typical mammalian habitats, and species-specific features, such as morphological attributes, geographic range, reproductive, and survival tendencies, were all included as predictor variables.
We also took into account research-based aspects such as the methods scientists utilized to conduct their research. To create our model, we gathered 3.8 million data points in total. We discovered that three sorts of predictor variables stood out the most in our model.
The first kind included species-specific characteristics such as body mass and geographic range. Small mammals with vast ranges are more likely to have hidden species, according to these findings. This makes sense because, all else being equal, scientists have a harder time recognizing physical distinctions in smaller creatures than in larger ones. The second kind was climate; in damp, warm places with a considerable change in day and night temperatures, there are likely to be more concealed species. This is most likely due to the fact that tropical rainforests have a lot of mammalian variety.
The third category was research effort, which included the geographic distribution of samples in museum collections as well as the number of recent publications citing a known species’ scientific name. This means that researchers are often successful in identifying new mammals, as the amount of attention paid to a particular mammal by the scientific community predicts whether or not that creature will be identified.
This is backed by the fact that the broad features we discovered match newly described mammalian species during the last 30 years, as well as the fact that our algorithm recognizes locations where scientists are already looking for hidden species. The evolutionary distribution of several mammals is depicted in this diagram.