Skepticism and Science
As many of you who read science news will know, the ENCODE data set is currently lighting up the world of molecular biology. For the first notable time since the Human Genome Project, loads of genomic data is accessible to anyone with an internet connection; seriously, though, there’s an iPhone/iPad app. Furthermore, where the Human Genome Project said, “Here’s the human genome,” ENCODE has now responded, “Here’s what the human genome does” - or at least has moved us closer to answering that question. Where the Human Genome project is a dictionary, ENCODE is an encyclopedia - and a very valuable one at that. No doubt about it, ENCODE is awesome.
The press coverage of ENCODE has not been up to par. While science journalism can be hit and miss at the best of times, ENCODE seems to have caused a giant “miss” among several major, highly-regarded news sources. (New York Times, I’m looking at you). While ENCODE has been lauded by the press for debunking “junk DNA”, some of the claims made about ENCODE’s research (however cool it may be) are just not true. In fact, I would argue that by misrepresenting the facts in news stories, journalism has clouded the amazing contribution ENCODE has made to molecular biology - one that no scientist will contest, as it was a massive, 10-year international project featuring 442 scientists that has spawned 30 research papers in different journals - basically, some seriously hard-core research.
As a case in point, several sources (see below) have attributed the journalism revelation that “junk DNA isn’t actually junk” to the ENCODE project. In fact, scientists have known for decades that protein-coding genes are regulated by non-coding DNA sequences - “gene switches” - found in the “junk DNA”, or non-protein-coding sequences of our genome. That’s uncontested, and there are plenty of reviews on the subject (as in this review, and its references) that were written long before ENCODE’s publication.
As Mike White, Ph.D., current Department of Genetics and the Center for Genome Sciences and Systems Biology member at the University of Washington School of Medicine, and frequent science blogger, says, “ENCODE is significant because they’ve provided a very useful data set, and not because they’ve a) shown that non-coding DNA is important (we knew that), or b) most of the genome has phenotypically important regulatory function (it does not) or c) that most of the genome is evolutionarily conserved (not true either). What they have shown is that much of the genome is covered by introns, and it’s hard to find biochemically inert DNA, which those of us who have tried to generate random, ‘neutral’ DNA sequences (for, say, spacers in synthetic promoter experiments) will agree with.”
Ryan T. Gregory, an evolutionary biologist at the University of Guelph in Canada, has compiled a list of news sources covering the ENCODE beat with the title, “The ENCODE media hype machine.” Let’s have a look at just a few offenders.
The New York Times is perhaps most disappointing, confusing activity with necessity:
The human genome is packed with at least four million gene switches that reside in bits of DNA that were once dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs, and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches…
As scientists delved into the “junk” - parts of the DNA that are not actual genes containing instructions for proteins - they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed.
At least 80% of this DNA is biochemically active, according to the ENCODE project - not needed. Furthermore, ENCODE did not discover a complex system that controls genes; they discovered exactly how complex the network scientists already knew existed is.
USA Today also seems a bit lost, stating that the 80% of the genome that ENCODE found biochemically active contains promoters and enhancers:
International research teams have junked the notion of “junk” DNA, reporting that at least 80% of the human genetic blueprint contains gene switches, once thought useless, that controls the genes that make us healthy or sick.
While Wired is so confused, it’s hard to know where to start:
Molecules that didn’t form protein-coding genes were mostly overlooked, partly because they were considered less important, but also because new tools and techniques were needed to study them.
It’s definitely news to me they’re less important - if anything, they’re more important than the coding sequences, as we understand less about them and they serve to significantly regulate the protein-coding bits of our genome.
In the ENCODE data are thousands of newly identified structures known as pseudogenes, fossil genes and dead genes, which look like protein-coding genes but perform other functions.
Pseudogenes perform other functions? Oh really?
Sure, I’m being a bit pedantic. But honestly, science journalism has gotten out of control. There are obviously very reasonable parts to all of these articles as well, but they’re drowned in so much hype and “catch phrases” designed to grab attention that the end result is a total distortion of some totally awesome scientific research that deserves to make the front page for what it’s actually accomplished. Especially in an age when we have eminent Harvard researchers fabricating data, we really don’t need journalists drawing false conclusions about meticulously collected data just to jazz it up and make it more interesting to the layman.
Finally, I think the scientists quoted in these pieces are part of the problem. For example, NPR had a very good piece about ENCODE, by all accounts, but its credibility was slightly tarnished by this quotation:
"Most of the human genome is out there mainly to control the genes," said John Stamatoyannopoulis, a geneticist at the University of Washington School of Medicine, who also participated in the project.
There’s nothing empirically wrong with this statement, except that it’s drastically overblown, and would never be made at a genetics conference as it would be torn to shreds. In fact, that’s the problem with most of the scientists I’ve seen quoted in these articles: They make absurdly broad claims for function using an extraordinarily loose definition (“reproducible biochemical activity.”) It’s very, very tricky to demonstrate function. And, more importantly (those of you who hated statistics, prepare to groan) they’re operating without a serious null hypothesis: What exactly do you expect non-functional DNA to look like?
As Mike White also pointed out, it’s not going to be inert. “Nucleosomes have low sequence specificity, and so we expect, in a large genome, many regions that, just by chance, have a random piece of DNA that reproducibly positions nucleosomes. Transcription factors recognise short, degenerate sequences that occur, again, just by chance, all over the genome. And so again, in a large genome, we expect plenty of reproducible but functionally irrelevant TF binding. That’s going to lead to pervasive, tissue-specific transcription at low levels, along with various chromatin marks. Transcription factor binding sites turn over fairly rapidly by evolution, and so we expect dense, complicated networks just by chance,” he writes. If the biology terminology proved a bit much, basically what he’s saying is that it’s really hard to define “inert” DNA, as transcription factors, the DNA-copiers, will bind to specific sequences that will appear by chance throughout supposedly neutral DNA, causing low-level biochemical activity (transcription) that doesn’t serve any valuable function. Evolution can cause these degenerate sequences to turn into dense, complicated networks just by chance.
The moral of the story? Be a skeptic! Just the other day, when I was writing my post “Harnessing Viruses”, I read the PLoS Genetics paper, a ScienceDaily article, and a PopSci article about the same set of results. I found some significant discrepancies between the PopSci article and the actual scientific paper itself - for example, the paper itself lauded weakening the tumour’s defenses as being paramount to fighting cancer, whereas in its article PopSci interpreted that as “beefing up the body’s defenses.” It may seem insignificant, and PopSci’s claim may even be factually accurate, but it was not supported by the paper they cited as their source. Popularising science is a great goal, and I think it can really inspire people; I know how intimidating scientific papers can be to read, and communicating science is instrumental in getting everyone to see its importance, value, and to create a more scientifically literate society. That being said, though, science has to be communicated carefully: When it’s done badly, as with ENCODE, it can be just as bad as having shoddy science in the first place.
Read. Ask questions. Be skeptical. And do celebrate ENCODE for the contributions it has made, because it’s an extraordinary data set that I have no doubt will contribute to molecular biology and medicine for years to come, much like its predecessor the Human Genome Project.
Images above: The ENCODE logo, and a picture taken from one of the ENCODE papers by Gerstein et all. in Nature. The images are entitled: “Visualisations of networked linkages between genetic components broadly across the human genome (right), and a smaller, hierarchically arranged subset (left).”