Special thanks to Gary Saunders, curator of the European Variation Archive at the EMBL-EBI, for writing this guest blog post.
Human genetic variation data is everywhere: custom FTP sites, supplementary files associated with publications, various online databases, probably even on a flash drive in your desk drawer right now. Everywhere.
In 2014, we at EMBL-EBI decided to launch a portal to store genetic variation data in a somewhat more regimented manner. We called it the European Variation Archive (EVA; www.ebi.ac.uk/eva). The objective of the EVA is to serve as a ‘one-stop-shop’ of open-access genetic variation datasets; to negate the need for researchers, pre-doctoral students, reviewers (anyone, really) to search various locations to access human genetic variation data. Instead, we would load all datasets to a single repository at EMBL-EBI – all the data we can get our hands on.
Recently EVA was added as a data source to Repositive: ACCESS
In the next few paragraphs I shall explain more about the EVA service, the work we do and how we try to serve a role to the scientific community.
What is the European Variation Archive at EMBL-EBI?
Basics first. The EVA aims to serve as a genetic variation one-stop-shop to the scientific community. Datasets are added to the resource directly by submitters, or are curated (from publications, for example) by EVA staff.
Oh, right. So it isn’t only an archive of variants from Europeans! What data is in there?
We’ve gone with the tried and tested EMBL-EBI service nomenclature of “European <<data_type>datatype> Archive”. (Notable others include the European Nucleotide Archive (ENA; www.ebi.ac.uk/ena) and the European Genome-phenome Archive (EGA; www.ebi.ac.uk/ega).) So we have EVA; an archive of open-access genetic variation data that is in Europe, not an archive of only European variants.
As for what data do we store, currently there are over 150 human short and structural variant datasets at the EVA! The data we house can most easily be seen in the EVA Study Browser.
Can you tell me more about how these datasets are submitted to EVA?
Data is submitted to EVA in valid Variant Call Format (VCF) files. There’s an important point in the word “valid”: in the early days of EVA we found that the MAJORITY of VCFs submitted to EVA were not valid to the file format specification. Indeed, over 90% of the first 400 VCFs that were submitted to the EVA contained an error of some description rendering the file invalid. This has severe consequences to downstream use of the file.
So, in response, we created a VCF file validator. Each VCF that is submitted to EVA must be valid, and each VCF that is downloaded from EVA is ensured to be valid to the file format specification.
This sounds interesting, how can I access the data?
We provide access to the datasets at the EVA in three main ways:
Here you can browse all of the datasets available at the EVA. Each project is given its own webpage that gives more information about the study. There are also links to download the VCFs from a particular study by FTP.
We provide a lightweight browser to access the variant data housed at the EVA. This browser is split by species/assembly and there are filtering options such as variant consequences, allele frequencies and protein substitution scores.
3. The EVA API
We are a bioinformatics institute. We love to simplify the lives of the developers who want to interact with our services and databases. Therefore we also provide access to the EVA variant data programmatically via our REST API. Plugging into our API is undoubtedly the most efficient way to access the data, but of course there is a technical requirement that means it may not be the most suitable way for all users to access the EVA.
Great, but isn’t it confusing to view all these data from different datasets, that have been described, and maybe annotated, in different ways?
And therein lies the rub. It is very true that the variation data submitted to EVA has been described and annotated in different ways. Importantly, the EVA normalizes all variant data and annotates this homogenous variant population with only one variant consequence predictor: Ensembl’s Variant Effect Predictor.
Additionally, we calculate allele frequencies in a standardized manner - and also group variants from samples that are from a particular population together, in order to calculate population allele frequency values.
The result of our normalization and annotation processes is that the variants from different datasets can be grouped together for analysis. As we have done this, it is unnecessary for the many groups across the globe interested in the datasets that we house to do these complicated and computationally intensive processes themselves!
You can read more about our variant processing steps here: FAQ
Ok, Ok, I am sure this is for me, but I would like to see one more thing added to the EVA, how could I tell you?
We are always eager for feedback from users, potential users, other resource providers, funders, reviewers, everyone and anyone!
If you would like to suggest new features for the EVA, or to chat to us about pretty much anything please email <a firstname.lastname@example.orgemail@example.com</a> or you can start a ticket at our GitHub page.