At Hartwig we “analyse sequenced tumor DNA”. This is something we try to continuously improve. These analyses are shared with hospitals to help them find the best treatment for the patient. In addition, we aggregate all these analyses into our database from which we service researchers so that they can improve care for tomorrow’s patients.
We generally write about how researchers and hospitals use these analyses, but not so much about how we generate them. This article is about SAGE, a new algorithm in our suite of in-house developed algorithms, leading to richer analyses which benefit both patients and researchers.
Why does anything need to happen between sequencing a tumor biopsy and delivering analysed data?
Add to this the fact that we sequence and analyse whole genomes. While analysing the whole genome is great for individual patients as well as researchers, it does put a larger reliance on computers and algorithms to do the analysis. It is feasible to visually inspect a 100 nucleotides to see if there are any substitutions, but this becomes impossible when assessing 3 billion nucleotides!
SAGE is our new algorithm to find these nucleotide substitutions. I is essentially our visual inspection at-scale. On a high level the goal of the algorithm is to find every nucleotide substitution that happened in the tumor DNA compared to the DNA the patient was born with, while not making any mistakes in the process. In other words, we want to be 100% sensitive and 100% precise. In practice there is always a trade-off between those two. We can be lenient, which means that we find a lot of the real substitutions, but this also leads to finding a lot of false positives. A typical tumor has tens of thousands of nucleotide substitutions. This may sound as an enormous amount, but is actually not much, considering the 3 billion nucleotides that could be substituted, in theory.
Generally, the algorithm specifies a cutoff in terms of evidence required to deem a nucleotide that deviates from what we think is in the healthy DNA a “real” mutation. Setting the cutoff low means we find all the real mutations in the tumor DNA but also add a lot of false positives. Setting the cutoff too high achieves the opposite; we are very precise but might miss real mutations. While this cutoff is the key parameter in our algorithm, we have made many tweaks to this basis rule. These tweaks are mostly inspired by previous cancer research, but also through our collaboration with the Netherlands Cancer Institute in the context of the WIDE study, where we have compared sequencing analyses from routine care and whole genome sequencing in-depth on over 500 patients.
One example of such a tweak is that if we find weak evidence (below our cutoff) for a mutation that was demonstrated to drive cancer in previous research, we classify the mutation as “real”. Similarly, mutations with evidence just above our cutoff, which reside in a region of the genome that is known to be hard to sequence and possibly containing sequencing errors, we decide not to trust this mutation.
Months of testing and tweaking has resulted in this new algorithm called SAGE. The impact of the final version has been tested on the data of 100 patients. This uncovered a decent increase in new mutations found (averaging around 4% more mutations) with a minimal amount of additional false positives (less than 1%). For a patient, this means there is a better chance that we find relevant driver mutations. For researchers it means that our analysis is more complete, allowing them to gain more insights from our DNA analyses than before.
Please find the complete details of the algorithm at Github.