15th April 2021


Key Takeaways from AACR21: ‘Data Science and Machine Learning: Will They Revolutionise Cancer Cure and Research?’

by Henrietta Bull

As this year’s virtual AACR conference comes to a close, we wanted to share some highlights from our favourite talks of the event. This talk on the use of data science and machine learning in cancer research and clinical care, given by Michael Brands (Bayer), Matthew Albert (Insitro) and Eliezer Van Allen (Dana-Farber Cancer Institute), was highly insightful on the power of data and AI across cancer drug discovery, immune-oncology and precision medicine.

Talk 1: Role of Data Science and ML in Drug Discovery

Talk given by Michael Brands, Bayer AG, Berlin

Over the last 2 decades, the digital revolution has taken the world by storm, transforming almost every industry from logistics to mobility to electronics. The use of big data, advanced analytics and automation is growing continuously and has affected all of our lives in some way or another. However, digitalisation in R&D within the pharmaceutical industry appears to be lagging in progress behind many other industries, an occurrence which has been attributed to several factors including the sheer complexity of biological systems, a lack of relevant data and over-reliance on data extrapolation for deriving insights in new biological areas, and of course, the highly regulated nature of the pharmaceutical industry as a whole.

Digitalisation has huge potential for overcoming the well-documented productivity crisis in biopharma research, and could ultimately help to bring us closer to engineering better quality treatments for patients suffering from severe diseases, such as cancer. A number of computational tools are already being applied throughout all stages of the drug discovery process, from target identification to candidate delivery, and they are enabling the recognition of targets and/or pathways of high therapeutic relevance for cancer treatment.

In the context of target identification, the advent of modern sequencing methods and collation of larger patient-derived data pools has fundamentally transformed the way in which new drug targets are linked to pathways, phenotypes and disease conditions. Development of genomics-based target understanding has enabled deeper insights into the structural and functional characteristics of targets, which has helped to more accurately predict how ‘druggable’ a given target is. In lead generation, increases in computing power have enabled the use of large data libraries for virtual drug screening, particularly for target classes we have a solid structural understanding of, such as kinases. Many computational methods used in lead generation can also be further expanded to hit-validation and lead optimisation. Comprehensive data pools have been generated detailing molecular properties that influence drug absorption, metabolism, tolerability and drug-drug interactions, and exploration of these data pools using advanced analytics has enabled the development of tools that can be used for reliable in silico prediction of drug properties. Such tools are helping to guide scientists towards faster identification of clinical candidates with improved properties, ultimately allowing for greater success rates in the clinical phases of drug development.

Looking towards the future, it is clear the use of machine learning (ML) algorithms is growing within drug discovery, driving profound insights into disease targets and paving the way for a more integrated view on target validity, biomarker identification and target tractability. In addition, ML algorithms, such as the alpha-fold algorithm, are also enabling more accurate prediction of 3D protein structures based on primary amino acid sequences, opening up a whole new avenue for de novo design of therapeutic proteins. In the long run, ML models combined with active learning are expected to more and more strongly influence the drug discovery and development processes, both within oncology and beyond, to significantly reduce the need for expensive and time-consuming experimental efforts and enable scientists to really focus their attention on the drug targets that appear the most promising in silico.

Talk 2: Leveraging machine learning and biology at scale with the aim to establish predictive models for immune – oncology

Talk given by Matthew Albert, Insitro, South San Francisco

When looking at the use of machine learning in scientific research, we can see we are currently at a time of convergence, whereby advances in cell biology and biotechnology are facilitating the modelling of complex human biology and driving the generation of new, valuable data points at a truly unprecedented scale. Advances in ML have enabled algorithms to extract novel insights and identify subtle patterns in complex, high-content data, delivering predictions and outputs that go far beyond the levels of human capability. When considering these advances in the context of drug discovery, we see that ML models can be trained by end-to-end learning to discover relevant new features within complex datasets and use them to label data into appropriate categories, to a level that matches, if not exceeds, human abilities.

Due to the growing amounts of genomic data being generated, driven by the decreasing costs of DNA sequencing, and an increasing number of available human phenotypes (e.g. via UK Biobank Data, which has collected data from across 500k individuals over the last 30 years), we are poised to make incredible medical advances in the coming years. The application of ML models to high content, histological images is already helping to increase the power of association studies and extract meaningful insights about disease pathogenesis, which could be used to improve patient biomarker assessment in the context of clinical trials. For example, a type of ML model that uses a multiple instance learning convolutional neural network (CNN) architecture has been applied for helping to score non-alcoholic steatohepatitis (NASH) patients based on the 4 standard labels used by pathologists (fibrosis (Ishak) score, steatosis, hepatocyte ballooning, lobular inflammation). The ML algorithm was trained using 4178 H&E-stained slide images of patient liver biopsies from across 5 NASH clinical trials, which were cut into 256x256 pixels and translated into 2 million tiles that could be used to train the model. The model was then evaluated using 463 unrelated slides from patients the ML model hadn’t seen before and results showed that the correlation between the algorithm’s predicted scores and human pathologists’ scores was as good as that achieved in cross-pathologist comparative studies. In addition, performing a GWAS study using the CNN-generated NASH scores provided greater resolution and a quantitative estimate of progression/regression, which led to identification of two novel, genome-wide significant variants that had not been previously identified.

In summary, it is clear from these examples that ML is a powerful tool which can enhance the statistical analysis of imaging and molecular datasets to reveal new insights into disease biology. From a future perspective, researchers in this area are now looking to apply these types of processes to analyse tumour pathogenesis and identifying new drug candidates in immune-oncology, with the goal of better predicting which therapeutic interventions will be the most effective and safe in which patient populations.


Image 1. Adapted from Matthew Albert, presenting at AACR 2021

Talk 3: Toward clinically integrated computational oncology for precision cancer medicine

Talk given by Eliezer Van Allen, Dana-Farber Cancer Institute, Boston

From a clinical point of view, the last few years have seen a significant expansion in the volume of patient data available at the point-of-care. This has led computational cancer biologists to pose the question of whether we can use expanded molecular profiling at the point-of-care to guide individualised (precision) treatment of patients in oncology. While some early indicators of success were seen in building simplistic predictive models, the reality of the situation is that building an algorithm which can successfully help to drive reliable clinical decisions in cancer care is extremely challenging, and some disappointing results have been seen across a number of clinical trials, including the 2015 SHIVA trial, which showed a lack of statistical significance, and the 2017 MOSCATO 01 trial, in which only 7% of the successfully screened patients benefitted from the approach.


Image 2. Adapted from Le Tourneau et al. (2015) and Massard et al. (2017)

In order to address the problems in this space and ultimately build better, more useful ML models for use in cancer care, it is important to define what the key challenges are in clinical interpretation and think about how we might address them. Three examples of such include:

Challenge 1: Technology in the clinic is still evolving – we are beginning to go beyond first-order genomics (DNA only) and are moving towards more complex, second-order interpretations (Bulk RNA, immune assays, etc.)

Response: Build interpretation algorithms for evolving technology at the point-of-care, which can annotate and evaluate second-order genomic relationships we might have missed in the past.


Image 3. Adapted from Eliezer Van Allen, presenting at AACR 2021

Challenge 2: Knowledge about actionability is still evolving – understanding of how data outputs can be best interpreted and applied to derive maximal value for patients

Response: Add preclinical value to clinical interpretation. ML algorithms can be used to devise mechanisms which directly match patients to preclinical model systems closely resembling their own specific molecular profile. These models can then be used to test different therapeutic approaches and assess which are most likely to be effective for a given patient, thereby directly adding preclinical value to the precision oncology paradigm. This approach has already been applied to a group of melanoma datasets, where patient-cell line matches were made that could be inserted into an actionability report for use by clinicians.

Challenge 3: Clinicians need help understanding the data – algorithms need to produce reports free of complex jargon and more easily interpretable and actionable for clinicians

Response: Conduct comparison of clinician response to standard vs. enhanced/web-based reports. A randomised clinical trial has already been performed to test clinician responses to a standard report format vs. a new type of report that uses fundamental user interface principles to visualise information. The outcomes of the study highlighted the new approach enabled improvements in physician understanding of the information, however overall the format did not make a difference in informing physicians on which treatment choice to make based on the data. Ultimately, the results demonstrated that there is a massive educational component to the successful integration of precision medicine approaches into oncology clinical practice, and this is likely to be a time consuming and resource-heavy process.

Taking into account all these considerations, a lot of work is currently being done by Van Allen and colleagues in the computational biology space to develop an algorithm (nicknamed the molecular oncology almanac) that successfully addresses these challenges, and can be made easily accessible to anyone in need of precision oncology recommendations. The team have access to patient molecular data (mutations, short insertion-deletions, copy number alterations), as well as germ line information and RNA information (mutations, fusions, etc.), which can be integrated with both first- and second-order genomic factors, linked with preclinical model systems and finally, collated into a visually appealing and actionable report that could be used by physicians at the point-of-care. Crucially however, it is important to always keep in mind that if algorithms and computational tools are to become a key part of medicine, the data is the most important part and this may present some further challenges, in particular in terms of ancestral and socioeconomic diversity.

Want to know more about the use of Machine Learning and Data in oncology research? Read about how we supported the development of a machine learning ‘search engine for tumours’ here.

Cookies on our website

We use cookie to improve your user experience. If you’d like to know more, please refer to our cookies policy page.

Cookies on our website

We use cookie to improve your user experience. If you’d like to know more, please refer to our cookies policy page.