Arielle, Repositive’s Oncology Community Manager, interviews Mikhail Zaslavskiy ahead of BioData 2018, as part of a series of profiles of key speakers at the conference in drug discovery, data, and oncology research. Mikhail is Head of Research at OWKIN, which creates AI tools to help collective human intelligence extract knowledge from an influx of data, to discover the medicine of tomorrow. Keep your eyes peeled for the next blog instalments and catch Mikhail at his talk “Federated learning: leveraging the power of private data” at 3:55pm on 29 November, at BioData 2018.
Arielle: Can you share a little of your background and how you came to work at OWKIN?
Mikhail: As part of my PhD at Mines ParisTech specialising in machine learning models for graph structures, I researched techniques to identify similarities across large datasets and to display them in ways that can be easily interpreted, which has applications across drug discovery and design and image recognition. At the end of my PhD I moved out of academia to take on more intriguing and pressing problems at Cellectis, a genome engineering company. However, I ultimately wanted to explore more problems and fields where ML and data science could help to solve significant challenges , so I began data science consulting. I was persuaded to join OWKIN’s six-person team, four years ago, where I’ve been developing OWKIN’s stand-alone platforms and addressing custom problems for individual clients ever since.
A: Your company bio lists you as a top 100 data scientist on Kaggle. Do you think these types of crowdsourced competitions are important for the development of machine learning and data science?
M: Yes! I first started participating in competitions during my PhD and relished the opportunity to work on projects across insurance, shipping, finance, banking, bioinformatics and social networks. Kaggle motivates individuals to improve skills rapidly and to be part of sharing best practice data science techniques across fields. The feedback they receive is also invaluable as they develop data science and programming skills and understand the broader applications of different approaches, which might have been initially absent in a single domain. It’s a fantastic catalyst for inspiring individuals to develop new uses for data science in problem-solving. I was in the top 100 on Kaggle before joining Owkin, but now all my competitions are happening at Owkin! We also sometimes organise public data science competitions.
A: At BioData you’ll be talking about the use of federated datasets for research. Without giving away your talk, could you describe why federated datasets are needed?
M: When developing machine learning algorithms, a significant amount of data is required to build a quality training dataset for the model. Smaller datasets lead to incompletely trained algorithms, which in turn lowers the output quality. Any company that wants to apply machine learning to a problem for which they have some, but not sufficient, data has two options – either more data needs to be generated, or additional sources of data need to be found. OWKIN’s approach is to find the additional sources of data by approaching companies or research groups that are conducting similar research in the area and creating these “federated” datasets which can then be used to properly train efficient machine learning models.
A: How difficult is it to convince these companies or research groups to share their data?
M: This is a major challenge, as you can imagine. And there are two main reasons for that, first of all when we are working with patient data it is very important to have solutions which do not compromise patient privacy, so merging multiple datasets together is not an option. Many organisations are reluctant to collaborate because datasets can be a major source of IP and they worry that sharing them will reduce their competitiveness; finding suitable pre-competitive sweet spots for collaborating organisations can be a challenge – but it does happen. However, once an agreement is in place there are more barriers to overcome! Datasets are often hosted in different locations and cannot be merged due to the previously mentioned competition issues or having a variety of different data standards (if any). So we have disparate datasets, in different locations, which cannot be merged.
This is why it is important to use the federated learning approach, which allows data stored in relative isolation to be accessed and combined without merging datasets.
A: Are there any barriers or limitations to this approach?
M: The two main issues are data heterogeneity and privacy concerns. There are significant issues around sharing data with the competition, but also ensuring that proprietary data remains the property of its owners and protects the identity of people in patient or genomic datasets. Preventing other entities accessing ‘too much’ data or making malicious attacks to try and mine patient information is a hot topic for several research groups.
A: Can you talk a little about OWKIN’s approach to using datasets and machine learning in drug discovery?
M: OWKIN primarily applies machine learning to open problems in medicine, such as helping to characterise heterogeneous patient or cell populations, supporting drug design or performing post hoc analysis on datasets from failed clinical trials, to identify whether any patient subpopulations could still benefit from the drug.
We work with multiple modalities of data, from electronic patient records – which can include data from blood tests, demographic information and treatment histories – to tumour biopsy histology or radiographic images. OWKIN parses the data and evaluates it for different signals depending on the questions being asked: we could be looking for new markers for response to drugs or identifying prognostic markers for overall patient outcomes, for example.
A: What are the main barriers to harnessing machine learning in drug discovery and development?
M: There’s a real need for connected datasets, and it isn’t locating the data that’s hard, it’s accessing it. There’s also a need for educating front-line physicians about how data shapes the process because if they are not confident in the data supporting a drug’s approval, they’ll be more reluctant to prescribe it. Society also needs to develop the skills and infrastructure to support data culture – to collect, curate, store and maintain data effectively.
A: What impact do you think machine learning will have on the medicine of tomorrow?
M: Improving clinical trial design, specifically by stratifying patients as part of the screening process and incorporating aspects of adaptive clinical trial design to allow for adjustments mid-trial, might improve success rates for drugs. Ideally, clinical trials would be iterative processes, with multiple data modalities collected throughout in a standardised fashion and fed into machine learning analysis to understand the outcomes and improve subsequent trial iterations. The inclusion of real-world data to inform the clinical trial design is something I’d love to delve into more too!