SAGE: New algorithm for analysis of tumor DNA reveals mutations previously not found

At Hartwig we “analyse sequenced tumor DNA”. This is something we try to continuously improve. These analyses are shared with hospitals to help them find the best treatment for the patient. In addition, we aggregate all these analyses into our database from which we service researchers so that they can improve care for tomorrow’s patients. 

We generally write about how researchers and hospitals use these analyses, but not so much about how we generate them. This article is about SAGE, a new algorithm in our suite of in-house developed algorithms, leading to richer analyses which benefit both patients and researchers.

Why does anything need to happen between sequencing a tumor biopsy and delivering analysed data?

  • the sequencer doesn’t sequence whole chromosomes. Instead the DNA is shattered first and the resulting fragments are sequenced individually. The result is that for a typical tumor, we sequence 2 billion fragments of DNA of 150 nucleotides that originated from the tumor biopsy. We need to piece together all these fragments, which is like trying to piece together a glass that was shattered on the ground into a million pieces.
  • an additional complexity is that the sequencer can make mistakes. This is like shattering a glass, then replace a few pieces with other things, add some shards from another glass and then try to put it together.
  • while we do have an example of what the glass looked like, we do not exactly know how the sequenced genome originally looked before we shattered it. Even though humans are very similar in terms of DNA, they are still all unique! The result is that we have to put together the glass, with some pieces replaced or removed and only a rough idea of what the glass initially looked like. 
  • the next level of complexity comes from the fact that we want to know what is different in the tumor DNA compared with the DNA that the patient was born with. We need compare two sets of DNA, while having only a rough idea about how either looks like. In addition, there is no set amount of how different tumor DNA and healthy DNA should be. While we know cancer originates from mutations in cell DNA, some tumors have accumulated only a few hundred mutations while others have gone beyond one hundred thousand mutations.
  • the final complexity comes from the fact that a tumor biopsy is not a homogenous lump of cells. A tumor mass is in fact a collection of different cells. There may be for example blood vessels, or stromal cells. Perhaps immune cells have infiltrated the tumor and are trying to kill it. Finally, the tumor cells are a hybrid, evolving “organism”, with different populations of tumor cells which share mutations, but also have their own unique mutations. This varies greatly from one tumor to the next, but all in all we have to assume that “the tumor biopsy” that we sequenced included DNA of various cells, some of which may be tumor cells which may or may not share their DNA make-up amongst each other. 

Add to this the fact that we sequence and analyse whole genomes. While analysing the whole genome is great for individual patients as well as researchers, it does put a larger reliance on computers and algorithms to do the analysis. It is feasible to visually inspect a 100 nucleotides to see if there are any substitutions, but this becomes impossible when assessing 3 billion nucleotides!

SAGE is our new algorithm to find these nucleotide substitutions. I is essentially our visual inspection at-scale. On a high level the goal of the algorithm is to find every nucleotide substitution that happened in the tumor DNA compared to the DNA the patient was born with, while not making any mistakes in the process. In other words, we want to be 100% sensitive and 100% precise. In practice there is always a trade-off between those two. We can be lenient, which means that we find a lot of the real substitutions, but this also leads to finding a lot of false positives. A typical tumor has tens of thousands of nucleotide substitutions. This may sound as an enormous amount, but is actually not much, considering the 3 billion nucleotides that could be substituted, in theory. 

Generally, the algorithm specifies a cutoff in terms of evidence required to deem a nucleotide that deviates from what we think is in the healthy DNA a “real” mutation. Setting the cutoff low means we find all the real mutations in the tumor DNA but also add a lot of false positives. Setting the cutoff too high achieves the opposite; we are very precise but might miss real mutations. While this cutoff is the key parameter in our algorithm, we have made many tweaks to this basis rule. These tweaks are mostly inspired by previous cancer research, but also through our collaboration with the Netherlands Cancer Institute in the context of the WIDE study, where we have compared sequencing analyses from routine care and whole genome sequencing in-depth on over 500 patients.

One example of such a tweak is that if we find weak evidence (below our cutoff) for a mutation that was demonstrated to drive cancer in previous research, we classify the mutation as “real”. Similarly, mutations with evidence just above our cutoff, which reside in a region of the genome that is known to be hard to sequence and possibly containing sequencing errors, we decide not to trust this mutation. 

Months of testing and tweaking has resulted in this new algorithm called SAGE. The impact of the final version has been tested on the data of 100 patients. This uncovered a decent increase in new mutations found (averaging around 4% more mutations) with a minimal amount of additional false positives (less than 1%). For a patient, this means there is a better chance that we find relevant driver mutations. For researchers it means that our analysis is more complete, allowing them to gain more insights from our DNA analyses than before.

Please find the complete details of the algorithm at Github.