Posted by Charlie, November 2016

Why reuse publicly available human genomic data?

Recently I published a series of blog posts on sharing genomic data. They were mostly about the experiences our users had had when submitting their data to various genomic data repositories. However, in the first post I went into some detail about why researchers should share their data. One of the major reasons I mentioned is so that it can be re-used (or recycled) by the community for secondary analysis. This re-use increases the value of the data and decreases the relative cost of producing it. It also prevents the duplication of effort and opens doors to new research insights ^1.

Repositive is all about data sharing and open science ^2, and we have over 1 million datasets indexed on the Repositive platform that are publicly available. Our mission is to drive efficient and ethical access to human genomic data. Therefore, in this blog post I want to look into how data is being re-used by the research community, and to address the value of data recycling.

PublicData quote

Reasons why human genome data should be reused:

  • Individuals, usually patients, have consented their data for reuse so we should make the most of it!
  • Public money has been spent on creating this data - each time data is re-used its value goes up and relative cost goes down.
  • To reduce the duplication of effort, which is a waste of time & money.
  • To help make scientific advances, faster.
<br> ## It’s a hot topic I’m not the only one interested in who, why and how researchers reuse publically available data. Earlier this year ScientificData published an [interview with Daniele Marinazzo]( http://blogs.nature.com/scientificdata/2016/03/01/data-reuse-an-interview-with-daniele-marinazzo/). This post caught my eye as it was part of a theme of posts they are publishing on data reuse. In Daniele’s group “... validation of methodologies is done on publicly available data, and the code is always shared.”

"For the first part I always prefer to use publicly available data, both because their quality is normally excellent and checked, and to allow other researchers to reproduce my results. For the second part, I either rely again on public data, or on my collaborators." Daniele Marinazzo

Furthermore, a huge amount of ‘chatter’ has been going around on social media and within the research community about the ‘research parasite’. This refers to an editorial on Data Sharing that was published in the NEJM earlier this year ^3 where two clinical researchers raised the concern that

“people who had nothing to do with the design and execution of a study use another group’s data for their own ends, possibly stealing from the research planned by the data gatherers, or even using the data to try to disprove what the original investigators had posited.”

They termed these individuals as 'research parasites'.

DataThief Data Thief (From BlueCoat)

I have not followed the press about this closely enough to pass judgement on the authors of this editorial, I (possibly naively) hope that their words were taken out of context – in this article they are strongly pushing the idea of collaboration and symbiotic data sharing; a practice I wholeheartedly support. However, this editorial has portrayed the ‘data recyclers’ in a bad light, and for better or worse has resulted in an uproar within the community. So much so that now the ‘research parasites’ are being given awards for rigorous secondary data analysis ^4.

Therefore, I am clearly not the only one taking an interest in who, how and why reuses data. Furthermore, because human genomic data is surrounded by controversy and ethical obstacles, it is of paramount importance that we understand how the data is being reused and the social impact this is having.

Is it being reused?!

Preliminary findings from a survey I performed with the Earlham Institute earlier this year demonstrated that over 80% of researchers used externally sourced data in their research. You can read more about this study in my previous blog post. The survey is still open - if you want to participate and have your say CLICK HERE.


Findings from Johan Rung and Alvis Brazma in their review on 'Reuse of public genome-wise gene expression data' highlights the following (2012) ^5 :

"It is a difficult problem to estimate third-party usage of public data and its impact on new research. When data is reused, the original study that has generated the data appears to be almost always credited in some way, but there are no easy ways to assess the accumulated reuse of data by third-party (added-value) databases. It has been suggested that using digital object identifiers (DOIs) may help to improve tracking the use of primary data and acknowledging their authors."

However, against these odds they described the following findings:

Nearly one in four studies used public data to address a biological problem without generating new data from samples. Such studies draw on the power in numbers: by combining many datasets, the power to detect weak signals is improved. In approximately 25% of the studies that they reviewed, the data from public resources was used in combination with new data, typically to provide a replication set from an independent source.

This data is supported by a recently published report about open research by the Wellcome Trust. Here, they found that 77% of respondents have used existing data in some way, while 23% have never reused existing data. However 22% weren't using it for research but as teaching material (see below figure for details). The report also showed that researchers in genetics and molecular science, infection and immunobiology and population health are more likely to reuse data. ^6

Main types of reuse:

Rung and Brazma ^5 also looked into the reasons why researchers reuse publicly available data:

  • To study a biological question.
  • To develop and evaluate a new method. When assessing the performance of a newly developed software tool or statistical method, public data archives can provide good material for testing.
  • To integrate, annotate and analyse primary data in order to build a new (added-value) data resource. Sometimes public data are combined with newly generated data to increase the number of samples covered.
  • To combine summary-level data, such as P-values or effect sizes from compared conditions, can be combined in meta-analyses. Such analysis is a very popular way of using public data because of the flexibility to include data from many different array platforms.

Picture1 Reuse of existing data by respondants (N=578). From Van den Eynden et al., 2016 ^6

Use Cases

Bas E. Dutilh, Assistant Professor Bioinformatics at Utrecht University Metagenomics Group:

Data is the most important resource for a bioinformatics lab. We analyse publicly available metagenomes with our tools and approaches to discover new science.

A few months back I looked in depth at one use case - the story of Jonathan Coleman, a PhD student at King’s College London, who used data from the UK Biobank to examine genetic factors contributing to the relationship between depression and high BMI. You can read my previous blog post to learn more about Joni's journey.

An interesting example of high impact data reuse was published earlier this year. A study ^7 by Fernando Mendez et al., from Stanford University came out in the American Journal of Human Genetics showing that Neanderthal Y-chromosome genes are no longer present in the human genome.

The data for the study came from public gene sequencing databases. "We did not collect any data for this work," said Mendez. "It was all public data." ^8

This study was of interest because it is well known that traces of Neanderthal DNA still linger in modern humans, however our relationship to this ancient Neanderthal population and our last common ancestor is still debated. Previous studies have looked at female Neanderthals or mitochondrial DNA, but this is the first study in which the Y-chromosome of a male Neanderthal has been compared to the modern day human population. The lack of Neanderthal Y-chromosome DNA in the modern male population suggests a new timeline for the divergence of humans and Neanderthals.


How to find data to reuse

Data repositories are specifically designed to store data, so researchers can find that data and reuse it - or at least that is the intention. Nowadays, almost all human genome data repositories have websites where they list what data they are storing. Therefore, you can, in theory, use google to find everything you might need. However, there are some caveats to this situation:

  1. There are multiple repositories (in fact hundreds!) storing human genomic data.
  2. The data is formatted differently in each repository.
  3. The search algorithms vary widely.
  4. Much of data is online but not in well known repositories and is therefore not 'visible'.
  5. Lots of data is 'hidden - it has not been shared in online repositories, and is siloed in institutions and on servers around the globe.
  6. Different data from the same patient (for instance genomic and proteomic) may be stored in different repositories, so you will not have the full dataset because its parts are fragmented and held under different access policies.

Repositive's role

At Repositive, we are trying to address the caveats listed above. We intend to index all human genomic data, worldwide. We are not data custodians, and are therefore not storing the raw data - instead we are acting as a portal where you can search through 'all' the genomic data that is out there on one online platform. Then, once you have found what you are looking for you can go ahead and download it from the repository itself. In the future we also aim to simplify the download process so it's as easy as a 'click of a button'.

We are inputting the metadata onto our platform in a consistent manner so it is more easy to cross-compare and search through. Furthermore, we are actively searching out all the small, obscure, less 'visible' sources of human genomic data to increase the visibility of 'hard to see' datasets - currently we know of over 350 human genome data sources. Finally, we are working hard to form relationships with companies, institutions and the genomics community to find and expose the 'hidden' data.

<br> ###References

^1: Assessing the research potential of access to clinical trial data. Varnai P., Rentel MC., Simmonds P., Sharp, TA., Mostert, B., de Jongh, T. (2014). Final report to the Wellcome Trust. Study led by Technopolis Group (UK).

^2: DNAdigest and Repositive: Connecting the World of Genomic Data. Nadezda V. Kovalevskaya et al. PLOS biology, March 24, 2016. doi:10.1371/journal.pbio.1002418

^3: Data Sharing. Dan L. Longo, and Jeffrey M. Drazen. N Engl J Med 2016; 374:276-277. January 21, 2016 | doi: 10.1056/NEJMe1516564

^4: The Parasite Awards

^5: Reuse of public genome-wide gene expression data. Johan Rung & Alvis Brazma. Nature Reviews Genetics | AOP, published online 27 December 2012 | doi:10.1038/nrg3394

^6: Survey of Wellcome researchers and their attitudes to open research. Van den Eynden, Veerle; Knight, Gareth; Vlad, Anca; Radler, Barry; Tenopir, Carol; Leon, David; Manista, Frank; Whitworth, Jimmy; Corti, Louise. (2016): figshare | doi:10.6084/m9.figshare.4055448.v1

^7: The Divergence of Neandertal and Modern Human Y Chromosome. Fernando L. Mendez et. The American Journal of Human Genetics. Volume 98, Issue 4, p728–734, 7 April 2016 | DOI: http://dx.doi.org/10.1016/j.ajhg.2016.02.023

^8: Y chromosome genes from Neanderthals likely extinct in modern men. Jennie Dusheck. Stanford News Center. April 2016.