Posted by Charlie, August 2016

Submitting genomic data to repositories: a necessary nightmare?!

One of the primary goals of Repositive is to drive efficient and ethical genomic data sharing. However, in my travels, through discussions with researchers and interviews with our users, I have come across much apathy and negativity towards the sharing of genomic data. After digging a bit deeper, aside from the paranoia of being scooped, or legal issues, the main hurdle people struggled with during the data sharing process was actually uploading the data onto online repositories.

Therefore, in this post I will go over the main reasons why one would be inclined to share data, and discuss how one can share data online. Then, over the next couple of weeks, I will release four more posts focussing on four major online genomic data repositories (GEO, ArrayExpress, EGA and SRA) and our users' perspectives of submitting data to these repositories.

Why share genomic data?

“Increased data availability and accessibility is key to make breakthroughs in precision medicine and diagnostics. Medical research in genomics requires both specificity and sensitivity, which is only possible by accessing and comparing large volumes of data. Easier data access will accelerate research and lower the cost of making new discoveries, which will provide benefits for both clinicians and patients in the form of new and better treatments.”^n

To further science & healthcare:

By aggregating and analysing large amounts of genomic data it may be possible to discover patterns that would otherwise remain obscure. In principle, this integration of genomic data and clinical information could reveal the genetic bases of cancer, inherited disease and infectious diseases — illnesses that have touched nearly every person and family across the globe.^1

Furthermore, as future technological advances are made, shared genomic data will become more and more important to researchers and clinicians worldwide as raw data can be reanalysed with newer, more powerful tools.

Because funders require it:

Most funding bodies that support research require sharing of the resulting data with the scientific community to ensure the translation of research results into knowledge, products, and procedures that improve human health. For example, the National Institute of Health (NIH) set out their Genomic Data Sharing policy in 2014 to set expectations around the sharing of genomic data. Cancer Research UK, the Wellcome Trust and the Medical Research Council (to mention only a few) are also dedicated to maximising the value of research data by ensuring it is made as widely available as possible.

For financial reasons:

The deposition of all genome-wide data into a single resource worldwide is essential for financial reasons. Though the cost of sequencing is decreasing, it is still by no means cheap ^2, and the number of samples required for valuable scientific insights can run into the thousands. Each study is so expensive and so many studies are needed that it is essential to share genomic data to avoid redundant efforts and a massive waste of (often public) money.

More details on sharing genomic data, including the problems behind it and the reasons for doing it, can be found in our recently published PLOS Biology paper(http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002418# pbio-1002418-t001) ^3.

To deposit or not to deposit- that is the question - journal-pbio-1001779-g001 Researchers can be reluctant to share their data publicly because of real and/or perceived individual costs. Illustration credit: Ainsley Seago. doi:10.1371/journal.pbio.1001779.g001(http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001779# pbio-1001779-g001)

How to share genomic data?

Before submitting data to a repository, there are two main things one needs to consider. First, the type of data you want to submit; for instance, array-based data would usually be submitted to repositories such as GEO or ArrayExpress, while unaligned raw sequence data would usually go to repositories such as SRA or ENA. Second, what type of access that data requires; public access allows researchers complete and open access to all files, while controlled access requires researchers to make an application to gain access to the files.

This figure from the EBI website gives an idea of which of their repositories you would submit data to depending on the type and access requirements.


In the following blog posts I will discuss how to submit data to the following repositories, and our users' experiences thereof:

  • GEO - open access, expression data.
  • ArrayExpress - open access, array-based and expression data.
  • EGA - controlled access, all data types.
  • SRA - open access, raw sequence data.

Looking to the future

After reading these posts you may start to think that all this is rather depressing. We can clearly see the reasons and necessity for sharing genomic data, but the mechanisms in place for doing so seem to be a complete nightmare to deal with. And currently that does mostly seem to be the case, however, do not lose hope! Things are changing. Many repositories are now focusing more and more on trying to aid the data submission workflow and are investigating ways to change this process for the better.

For example, at the EBI (where they host as many as 19 data repositories, 9 of which contain genomic data), they are trying to change their submission processes to make them more usable. The EBI web development team have hired experts in usability and user experience, who are actively trying to understand the issues researchers face when trying to submit their data to EBI repositories. Furthermore, they realise that sharing data is only of value if that data can be reused. Therefore, they are also trying to understand the requirements of the research community for the data that they store. They are investigating how EBI-hosted data is reused, and what is required to make it reusable, to then modify their data submission processes.

In my personal opinion, the best way to make the data submission process less of a nightmare is to be vocal. Researchers need to explain their frustrations and blockers and needs to the repositories, so together we can change the process. The repositories are blind without feedback from the researchers.

==Get in touch to tell me about your experiences with submitting data to repositories, or comment below to get the discussion going!==

Related Blog Posts

Submitting genomic data to repositories: Gene Expression Omnibus - GEO
Submitting genomic data to repositories: ArrayExpress
Submitting genomic data to repositories: European Genome-Phenome Archive - EGA
Submitting genomic data to repositories: Sequence Read Archive - SRA


^1: Global Alliance for Genomics and Health (GA4GH) ^2: Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) ^3: Kovalevskaya NV, Whicher C, Richardson TD, Smith C, Grajciarova J, Cardama X, et al. (2016) DNAdigest and Repositive: Connecting the World of Genomic Data. PLoS Biol 14(3): e1002418. doi:10.1371/journal.pbio.1002418(http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002418# pbio-1002418-t001)