How do you analyze 8,000 Tumor/Normal genomes in under two months?

It has been a long time coming: Hartwig Medical Foundation has transitioned to the GRCh38 (hg38) reference genome! For years, scientists using our database have been asking for this. Our analyses were still running on the older hg19 reference genome, even though the first version of GRCh38 was released back in 2013. For an organization like Hartwig Medical Foundation, making such a transition smoothly and reliably, while minimizing disruption to ongoing research, is quite a challenge. How do you redo ten years of computational work in just two months? And what advantages does this offer for the future? In this blog, I’ll give you a behind-the-scenes look at the project.
The reference genome – a bit of context
First, what is a reference genome used for, and what is the difference between these two versions?
A reference genome is a standardized version of the human genome—including a number of variations—that is used to piece together raw sequencing data. The sequencer we use produces relatively short DNA sequence fragments that still need to be mapped to the correct location. The reference genome essentially serves as the picture on the puzzle box.
hg19 still contained quite a few gaps and was primarily based on Western data. GRCh38 is an updated version that incorporates more data from other parts of the world and contains fewer ambiguous regions. Several errors have also been corrected. This affects the numbering of all base pairs, meaning it’s not enough to simply add a few updates to the old version. The entire puzzle has to be reassembled.
Re-analysing 8000 genomes at scale
Fortunately, we already had a strong foundation. Hartwig’s database has been built over the past ten years and now includes data from approximately 8,000 patients who have made their data available for research. All of these samples had to be reanalyzed in as short a time as possible. At the same time, the quality had to meet the standards required for medical diagnostics, since the data is primarily used by clinical researchers. That is ultimately Hartwig’s value: providing data that supports treatment decisions.
Two years ago Matthijs van Niekerk (IT Infrastructure Lead) and his team of software developers (Mathijs den Burger, Arne Roeters, Kasper Wolsink) began planning and preparations on this effort. It started with countless meetings, computational reworks and budgeting, leading to a detailed transition plans. Thanks in part to this preparation, we were finally able to launch the new analyses this past spring. One petabyte of (compressed) alignment data and dozens of terabytes of analysis files were recalculated in less than two months.
We kept costs as low as possible by optimising software and running the analyses on machines that were not in use at the time. That made the process more complex, but for the software developers, that was part of the challenge—and the fun. We reduced computing costs to an average of €11 per patient; when Hartwig first started 10 years ago, this was about €250 per patient.
Under normal circumstances, we process around twenty new patients per week. For this project, we processed 500–600 samples per day. At times, as many as 5,000 machines were running simultaneously in the cloud. Partner organizations that do not use cloud infrastructure can easily spend a year on an analysis of this scale. But for projects like this, where you need a short disruption time and scale it is essential to have access to enormous computing power within a short timeframe.
International alignment
Another facet to this endevour was international collaboration. This was a unique opportunity to better harmonize data so it can be exchanged and combined more effectively across borders. During the preparation phase, we had extensive discussions with international partners such as the International Consortium for Pediatric Oncology (ITCC) and Genomics England, which manages the largest whole-genome sequencing (WGS) database. Once they have also completed their transition, we will all be using the same reference genome, resulting in much more comparable analyses. This is a major step forward, since we know that independent validation of findings is essential for robust treatment decisions.
Next Steps
Meanwhile, Matthijs’s team is not slowing down. This project is only the beginning—a kind of test case. The amount of data that will be generated in the future will be many times larger than anything we have seen so far. Just last week there was great news from several companies that sequencers are still becoming faster and more affordable (the $100 genome is rapidly coming out of research to the clinic). Our goal is to move toward a model in which samples are sequenced locally—within hospitals, for example—and data is analyzed centrally. That is much faster than transporting samples by courier and ultimately better for the patient.

This project demonstrates that it is possible to analyze such large volumes of data quickly, reliably, and affordably. The foundation has been laid. With these comprehensive DNA analyses of tumors becoming the clinical standard, it will be easier to predict which treatments will—and will not—be effective. That way, patients receive the best possible treatment while avoiding unnecessary side effects.
That’s what we ultimately work for.
Hartwig is ready. Thanks for reading.
Joep
Joep de Ligt – Lead Data – Hartwig Medical Foundation | Bringing data and people together to accelerate and enable cancer research
You read a blog in the category IT. Perhaps you are also interessed in Hartwig Medical Database, Hartwig Medical Database, Innovation, Quality, Quality, Working in the cloud or Whole genome sequencing.Read all our blogs
By using WGS, we took a step forward in molecular diagnostics in our department from the analysis of less than 100 genes to approximately 20,000 genes per patient. This provides an enormous amount of valuable information, not only for the patient of today, but also for the patient of the future.