Proteins are fundamental to all life forms, dictating the complex biochemical interactions that maintain and drive the existence of every species. The functionality of a protein hinges on its structural domain organization, and the protein’s length is a direct manifestation of this. Given that every species has evolved under varying evolutionary pressures, one would intuitively expect protein length distribution to differ significantly across species.
Well, we report in a paper just published in Genome Biology that this is not the case.
Unexpected Homogeneity in Protein Length Distribution
In our study, we examined the protein length distribution across 2,326 species encompassing 1,688 bacteria, 153 archaea, and 485 eukaryotes. Counter to expectations, we observed a striking consistency in protein length distribution across these species. Though eukaryotic proteins were somewhat longer, the variation in protein length distribution was notably low compared to other genomic features such as genome size, gene length, number of proteins, GC content, and isoelectric points of proteins.
Exceptions: Errors or Biological Peculiarities?
We did note a few atypical cases of protein length distribution, but these were typically due to inaccuracies in gene annotation: no well-annotated model species displayed enrichment in small proteins, and those with a high number of small proteins were more likely to have incomplete or fragmented genome annotations.
Indeed, the outliers tended to include many more genomes scoring low in BUSCO quality score. The only exception we observed was the prevalence of longer proteins in the Ustilago fungal genus and the Apicomplexa phylum, known for their intracellular parasitic lifestyles.
This suggests that the actual variation in protein length distribution might be even smaller than what we reported. Hopefully, resequencing and reannotation efforts will help solve this issue in the future: we already noticed a few species getting updated proteomes where the length distributions gets more similar to the typical one!
A Universal Selection Force at Play
The startling uniformity of protein length distribution across diverse species suggests a strong, universal selective pressure, maintaining a high proportion of the coding sequence within a specific length range. In the discussion part of the paper, we articulate a number of potential explanations, but these remain highly speculative.
More positively put, the evolutionary forces behind the uniformity of protein distribution and their potential impact on fitness remain exciting areas of exploration!
Protein Length Distribution: A New Criterion for Gene Quality?
This observation led us to propose the use of protein length distribution as a new criterion of protein-coding gene quality upon publication. Considering that the overabundance of spurious proteins could potentially bias downstream analyses, this quality measure could aid in identifying and rectifying annotation errors. We also encourage everyone to take a look at this simple criterion when selecting proteomes for comparative genomics analysis.
Story behind the paper
The basic premise of the paper, exploring protein length distribution across the tree of life, may seem straightforward at first glance. Not quite. It started as part of Yannis’s PhD in Odile Lecompte‘s lab in Strasbourg—and a few questions: what are the characteristics of the thousands of publicly available proteomes? How to decide which to include in large scale analyses? It took another three years of Yannis’s postdoc, with about half of that time spent in the peer-review process.
Perhaps the most revealing testament to the depth of this work is the supplementary PDF, a 68-page document filled with detailed data and analyses. Moreover, anyone interested in the peer-review history of our paper can delve into the 18-page record available here.
The journey is the reward, they say; well in this instance, we are quite happy to have reached our destination!
Reference:
Nevers, Y., Glover, N.M., Dessimoz, C, Lecompte, O. Protein length distribution is remarkably uniform across the tree of life. Genome Biol 24, 135 (2023). https://doi.org/10.1186/s13059-023-02973-2