A Googol of Genomes?
Earlier this week we took a look back at 2010 and offered our projections for the coming year in personal genomics. Topic #1, just as it was last year: the $1,000 genome.
In hindsight, it might have been ill-advised to offer predictions about the near-term future of genome sequencing during the same week in which one of the year’s major industry conferences (the JP Morgan annual Healthcare Conference) is taking place.
There have been a number of high-profile announcements from genome providers over the past two days. Life Technologies disclosed it had booked 60 orders for the recently unveiled Personal Genome Machine (PGM) and, more importantly, announced that the PGM’s output would be increased by an order of magnitude (10 megabases to 100 megabases) in Q1. Not to be outdone, Illumina, the current market leader in genome sequencing technology, responded later the same day by unveiling its new MiSeq machine. As both Matthew Herper and Keith Robison explain in detail, the MiSeq is a direct and formidable challenger to the Personal Gene Machine as a result of its price, speed and utilization of Illumina’s established sequencing platform.
But the biggest stories, at least by one metric, belong to sequencing newcomer Complete Genomics and Illumina (again). Complete Genomics announced this week that the Institute for Systems Biology (ISB) has ordered a whopping 615 whole-genomes as part of the ISB’s ongoing research into the genetics of neurodegenerative diseases, including Huntington’s. Meanwhile Illumina, at the same time it was launching the MiSeq machine, disclosed that it “currently has a 1,000-genomes backlog” for its own whole-genome sequencing service.
Let’s forget, for a moment, about how much these whole-genome sequences cost and reflect on simply how many of them there are. Just over a decade ago, Bill Clinton and Tony Blair were lauding the first draft human genome sequence; and the Human Genome Project would not declare the first genome “complete” until the spring of 2003.
Again, seven and a half years ago, there was only one single genome sequence to be had anywhere in the world. And it took 13 years and $3 billion dollars to get just the one. Today? We casually discuss hundreds and even thousands of genomes to be sequenced in a matter of months and for thousands of dollars apiece (not $1,000, but likely less than $10,000).
Even if the $1,000 genome does not arrive this year, 2011 will almost certainly see 1,000 genomes sequenced. And in many ways, that may be the worthier milestone to celebrate. Every significant increase in the number of sequenced genomes means a corresponding increase in the amount of genomic data available to elucidate the genetic bases of human traits and disease. There is still a tremendous amount of work to be done to make sense of all that data (and a tremendous amount of environmental, trait and other data that must also be correlated), but sequencing thousands of genomes is a significant step in that process.
_____________________________________________________________________________________





Excellent paper as usual!
I may need to retrieve from where I got these data, but it was published somewhere in the media last year full genome sequences capacity worldwide should be of:
50,000 genomes in 2011
250,000 in 2012
1m in 2013
5m in 2015
25m in 2015
I guess these numbers didn’t include extra instruments like the PGM and the MiSeq, but it gives an idea of the potential genomic wave coming on.
Those 1000 genomes this year and more to come next year will be more valuable the more phenotypic data come with them. I know you know this (and George Church has been saying it for years). I don’t know, because I haven’t been paying attention until recently, how much and what kinds of phenotypic data are likely to come with them or to what extent either the genomic or the phenotypic data are likely to be accessible to researchers. I’d welcome whatever you have to say about this in future posts.
I cross-posted this over at Genomes Unzipped and wanted to clarify here a few points that have been made in the comments there:
1) In discussing the number of genomes to be sequenced in 2011 I was considering only human genomes and only high-quality genomes (>30x coverage of >90% of the reference).
2) Ewen Callaway, in his recent post where he reminds us that, soon, only our mothers will care about our genomes, links to a Nature feature from last fall that projects more than 30,000 genomes in 2011.
That number includes high- and low-coverage, but doesn’t include any private biotech/pharma sequencing, which is significant but also difficult to measure.
3) @Ralph: You’re exactly right that the genomes in isolation are not going to be all that valuable. As for what type of phenotypic data are being collected, it depends heavily on the context. Obviously clinical and research whole-genome sequencing are very different. At the moment, far more genomes are being sequenced in a research setting – that may change at some point, but not in 2011 – and there it depends heavily on the study protocol.
Most are limited in the amount of phenotypic data they can collect (and even more limited when it comes to publication) because of privacy concerns. But projects like the Personal Genome Project, where confidentiality is not promised and identifiability is openly disclosed and accepted, are free to collect and to link (an important, and separate step) almost any available phenotypic/trait data. Right now, PGP participants are able to link their Google Health accounts with their genomic data, which is a very good first step. (PGP-1K data is here; better interface is coming soon.)
It seems incredible to say this, but I think for a few years now it has been easier, and possibly even cheaper, to generate high-quality genomic data than phenotypic data. Standardization is one major hurdle: it’s far easier to produce 1,000 genomes on the same platform and to the same data specifications than it is to do the same with 1,000 extended family medical histories. Even once we have standardized collection tools and data formats for phenotypic data, there remain all manner of data (environmental exposures, developmental data, etc.) that simply cannot be accurately reconstructed. Moving to electronic health records will help, but I think it may be decades before standardized, high-quality phenotypic data becomes routine.
It’s been clear for some time that it was going to be a lack of high-quality phenotypic/trait data, and not genomic data, that was going to be the rate-limiting step for a wide range of personalized medicine research. If we’re not there already, I think we will be very, very soon.