Can computers keep pace with genomics?

DNA data is too much for current systems, some say
By Mike Miliard
12:00 AM

There's a lot of excited commentary out there about genomics and personalized medicine. And no doubt it's exciting. The prospect of being able to home in on hitherto inaccessible data at the DNA level - thus enabling perfectly tailored treatment plans for some of the most dangerous diseases - is thrilling, to say the least.
"We shouldn't expect immediate major breakthroughs, but there is no doubt we have embarked on one of the most exciting chapters of the book of life," said professor Allan Bradley of the UK's Wellcome Trust Sanger Institute after his team helped make a complete map of the human genome in 2003.
His predecessor at the institute, Sir John Sulston, predicted that scientists could "go on mining the data from the human genome forever." Bradley himself told the BBC that there was "a long road" of discovery ahead.
It's one we're still traveling, of course - but perhaps not as fast as we'd like. While our understanding of the human genome has grown by leaps and bounds in the past decade, the computing power necessary to more finely decode and chart treatment paths has not been able to keep adequate pace.
When it comes to the heavy horsepower necessary for computational genomics, healthcare has some catching up to do. A July 2012 story in MIT News shows that, since 2002, the speed with which we can sequence genomes has doubled roughly every four months; computing power, however, is only doubling every year-and-a-half or so.
That's not fast enough. And so researchers have been searching for new ways to gain advantages in the race to make the most of the genome map, learning more about gene and protein activities and proteins. That story pointed to joint MIT and Harvard University research that led to a new algorithm aimed at markedly reducing the time necessary to find a specific gene sequence in genomic database.
"You have all this data, and clearly, if you want to store it, what people would naturally do is compress it," Bonnie Berger, a professor of applied math and computer science at MIT, told MIT News. "The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses."
Still, much more is needed in the way of computing power if the vast and transformative potential of genomic discovery is going to be realized, says Ben Langmead, an assistant professor of computer science in Johns Hopkins University's Whiting School of Engineering.
This past July, Langmead, alongside Michael C. Schatz, an assistant professor of quantitative biology at Cold Spring Harbor Laboratory in New York, published a story in IEEE Spectrum titled "DNA and the Data Deluge."
It argues that genetic data is proliferating far too fast for computers to make sense of it all. And the "main purpose of writing" it "was to try to attract computer scientists to come solve the problem," said Langmead.
Despite all the sunny expectations for what unlocking this puzzle means, he says, few casual observers are aware of the challenges we face. "It's probably less appreciated than it ought to be that the computational problems and cost associated with computation in this field are very significant, still."
The problem, says Langmead, is that "there aren't enough computer scientists who are in the mix doing real problem solving for investigators who are doing genomics. I'm trying to get more people interested in solving the computational problems.
"It's easy to fall in love with the cool factor of DNA sequencing," he adds, "but if you really want to have an impact you've got to roll up your sleeves and figure out what scientists need done and how to do it."
One problem is that genomics and other life sciences fields aren't necessarily as attractive to the best-and-brightest brainiacs in the computer science industry as they otherwise might be.
"There are a lot of very well paying jobs in this field," Langmead reasons. "It is a little hard to waylay a student early enough and get them to stick with science early enough that they stay with it. A lot of undergrad computer students, myself included, go to school thinking they already know what they want to do. They go into undergrad thinking, 'I'm going to get an industry job when I get out: Google or Microsoft or Facebook or Twitter.'
"It doesn't necessarily occur to them that they can have an impact in the sciences," he adds. "They're usually interested in science, but they're also don't think of themselves as agents in the scientific endeavor until someone says, 'Hey, guess what? The human genome project would never have happened if it weren't for really clever computer scientists.'"
But now, more than ever, smart computer minds are needed to help make hay with the knowledge unlocked by the Human Genome Project. In many ways, in fact, the HGP was the easy part.
"A lot of the big interesting problems people work on now are the result of technology that's been developed since the human genome project - there's been a lot that's happened since 2001," says Langmead. "And the problems that have come about are chiefly computational."
One of these has to do with the first time a scientist ever sees the genome he or she is studying, he says: "We've never sequenced it before so we throw some DNA on a sequencer, we get a lot of fragments, snippets of DNA and now we have to put together the puzzle, essentially, without a picture of the completed puzzle. We have to put it together from scratch."
Another challenge has to do with indexing: "If you take that reference genome that we already have, we have clever algorithms and data structures that make it easy to find occurrences of shorter sequences inside that longer sequence. That's where a lot of the interesting work in the last five years has been - how to make those indexes as efficient and small as possible, and how to make querying those indexes as fast as possible.
One way to do that is to take an algorithm and run it on "lots of computers at once," says Langmead. "You double the number of computers, hope it gets twice as fast. But in practice that's hard to achieve. It's hard to divide up a problem just so, so that nobody's dong any redundant work and it really is twice as fast when you give it twice as many computers."
Getting to a point where sequencing is easy and quick is not impossible. After all, he says, "Google indexes a very large sequence, which is the entire Internet. And they make it easy to query. It happens very quickly and we get useful answers back. Which is a miracle. We're very interested in making that a reality for sequencing data."
In the meantime, the problems persist. "Things have changed since the last time genomics was on page one of The New York Times (with the cracking of the human genome)," he says. "A lot of the ways the research has changed have made the computational problems much bigger."
That said, "I absolutely do think that when more computer scientists put their minds to this work we'll be able to make sequencing data much easier and more convenient to use for typical genomic researchers."

Want to get more stories like this one? Get daily news updates from Healthcare IT News.
Your subscription has been saved.
Something went wrong. Please try again.