Geneticists began estimating the number of human genes as early as 2000, when the human genome sequence was still being sketched. Nearly 20 years later, armed with real data, they still can’t agree on that amount, a knowledge gap that has hampered their efforts to find mutation-related diseases. Until recently, scientists reported that there were more than 21, 000 protein-coding genes.
The latest results, using data from hundreds of human tissue samples, were published on the BioRxiv preprint server on 29 May. It contains nearly 5,000 previously undiscovered genes, nearly 1,200 of which carry instructions for making proteins. Overall, the total number of protein-coding genes was more than 21,000, up from a previous estimate of about 20,000.
However, many geneticists are not convinced that all the proposed genes will stand up to close scrutiny. Their criticism also highlights the difficulty of identifying and defining new genes.
“People have been trying to do this for 20 years, and we still don’t have an answer,” said biologist Steven Salzberg, who led the gene count.
The final answer?
In 2000, as genomics debated the number of human genes, Ewan Birney (now director of the European Bioinformatics Institute at Hinxton, UK) launched the Gene Race. He placed his first bet in a bar at the annual genetics conference, which eventually attracted more than 1,000 participants and a $3,000 prize. Bets on the number of genes ranged from more than 312,000 to just under 26,000, with an average of about 40,000. Since then, the range of estimates has narrowed, to roughly 19,000 to 22,000, but there are still differences.
Gene counts can vary depending on the data being analyzed, the tools used, and the criteria used to weed out false positives. The latest count used a larger data set and different calculation methods from previous ones, as well as broader criteria for gene definition.
Salzberg’s team used data from the Genotypic Tissue Expression (GTEx) project, which sequenced RNA (an intermediary between DNA and protein) from more than 30 different tissues in hundreds of dead bodies. To identify genes that code for proteins and those that don’t code in cells but still play an important role, they assembled 900 billion pieces of MICrorNA from GTEx and aligned them with the human genome.
However, just because a piece of DNA is expressed as RNA does not necessarily mean it is a gene. So the team tried to filter out the noise using a variety of criteria. For example, they compared their results with the genomes of other species, suggesting that sequences shared by distantly related organisms are likely to have been preserved by evolution (because they are functional) and are probably genes.
Ultimately, the team was left with 21,306 protein-coding genes and 21,856 non-coding genes, far exceeding the two most widely used human gene databases (the GENCODE genome maintained by EBI includes 19,901 protein-coding genes and 15, 779 non-coding genes, as well as 20,203 protein-coding genes and 17,871 non-coding genes listed in RefSeq, a database managed by the National Center for Biotechnology Information).
Kim Pruitt, a former RefSeq director, attributes the discrepancy partly to the large amount of data Salzberg’s team analyzed; Another major difference is that both GENCODE and RefSeq rely on human processing — looking at the evidence for each gene and making the final decision — while Salzberg’s group relies entirely on computer programs to sift through the data.
“If people like our genetic list, then maybe in a few years we’ll be the arbiter of human genes.” Salzberg said.
What is the definition of a gene?
It should be noted that many scientists still insist that they need more evidence to be sure of the accuracy of the list. Adam Frankish, an EBI computational biologist who coordinated GENCODE’s manual annotation, says he and his team have scanned about 100 protein-coding genes identified by Salzberg’s team. According to their assessment, only one of them seemed to be a genuine protein-coding gene.
Pruitt’s team members looked at a dozen of Salzberg’s group’s new protein-coding genes, but found none that met RefSeq criteria. Some overlap with regions of the genome that appear to belong to retroviruses that invaded the genomes of our ancestors; Others belong to other repetitive stretches and are rarely translated into proteins.
But Salzberg argues that some repetitive sequences can be considered genes. One example is ERV3-1, which appears in RefSeq and encodes a protein that is overexpressed in colorectal cancer. Salzberg also acknowledged that the new genes on his team’s list will need to be verified by themselves and others.
Most puzzling are the variations and imprecision of gene definitions. Biologists used to think of genes as sequences that encode proteins, but it has since been discovered that some non-coding RNA molecules have important roles in cells. This standard dispute over genetic determination also explains some of the differences between the Salzberg count and other counts.
An accurate accounting of all human genes is important to uncover the links between genes and disease. Salzberg notes that countless genes are often overlooked, even if they contain disease-causing mutations. But rushing to add genes to the main list also carries risks. A faulty gene will distract geneticists from the real problem.
“Biology is complex,” Pruitt added. The number of genes inconsistent between databases is still a problem for researchers, and a definitive answer is still being sought.”