Posted by Charlie, October 2016

Simons Genome Diversity Project - Now Featured on Repositive

Earlier this month the Simons Genome Diversity Project (SGDP) data (Browse) from the Simons Foundation website was added to the Repositive platform. The SGDP is a vastly diverse set of high quality human whole genome sequences consisting of data from 271 genomes: this includes the ‘Primary dataset’, which consists of 260 genomes from 127 different global populations, and 18 genomes from a previously published dataset.

The initiative behind the Simons Genome Diversity Project is the Simons Foundation. It was founded by Jim and Marilyn Simons in 1994 with the “mission to advance the frontiers of research in mathematics and the basic sciences”, by supporting scientists in the form of direct grants.

The basis of this project is to gain a complete picture of human genetic diversity. They have done this by sequencing individuals from many different populations with large numbers of present-day people that span human genetic, linguistic and cultural variation. This cohort is significantly more diverse than its ‘nearest neighbor’ in the field; the 1000 genomes project, which analysed 26 populations.

All of the genomes were sequenced using Illumina technology with at least 30x coverage. For comparison, most studies looking into diverse populations use 4-6x coverage. Details on how to access the raw data from this project can be found here: download readme. You will have to apply for a certificate to download the ~10 terabytes of data from their FTP site.

300 genomes from 142 diverse populations ^1

We released this data onto the Repositive platform to coincide with the publication of the new Nature article from this project ^1. In this article, the authors describe 300 genomes from 142 different global populations; details for accessing the dataset from this paper can be found at the bottom of this post.

In this study, they used this dataset to investigate human diversity and the landscape of human genome variation. You will have to read the paper yourself to understand its full impact, however, there are two key aspects that I found really interesting:

  1. They found high levels of Denisovan ancestry in some South Asians compared to other Euroasians. They postulate that this hasn’t been found before because previous studies have mostly excluded South Asian populations.
  2. They provide evidence that lifestyle changes and exposure to new environments were the driving forces behind the rapid developments seen in human behaviour over the last 50,000 years. This is in opposition to previous models that have attributed a small number of mutation in neurological genes to this change.

These findings highlight the crucial need for studies like this to have large numbers of genomes from a highly diverse set of populations. However, this is only the beginning! The data in this study offers huge popential for secondary analysis by the research community. Not only to look at human diversity and ancestry, but as healthy reference material for multiple ethnicities. This is further supported by almost all of this data being freely available. I’m excited to see what questions the research community will apply to this hugely rich dataset!

We mapped their data!

A map of where the individuals who donated their samples are from can be found on the SGDP website. However, we decided to see if we could plot it as well with D3.js (v4.0) using the metadata supplied.


The green dots on the map above correspond to the geographical coordinates of each of the populations featured in the dataset.

To create this visualisation we used the D3.js library to create a world map on which we projected the geographical coordinates from the SGDP dataset. We downloaded the Admin 0 - Countries file from naturalearthdata.com. This dataset contains geographical information about the countries of the world in a scale of 1:110 million.

The world-atlas tool, published by D3 creator Mike Bostock, was used to transform it into a file format which D3 understands. In this case TopoJSON which is an extension of the GeoJSON format. We applied a mercator projection to create the two-dimensional map above.

Next, we extracted the geographical coordinates of the populations represented in SGDP from the metadata found on Repositive.io. They were manually transformed into TopoJSON and passed to the visualisation. Unlike a picture graphic of a world map, the projection provided by D3 ensures a correct placement of these points on the map across environments.

A more detailed portrayal of this process and the embedding of a D3 visualisation into an Ember.js application will be published as a separate blog post in the future.

Special thanks to Dennis for adding his technical knowledge of D3.js for this blog post.

Getting the ‘300’ genomes:

279 of the genomes are consented for fully public data release. They can be found in ENA with accession numbers PRJEB9586. The remaining 21 genomes are deposited in EGA with accession number EGAS00001001959.


^1: The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Swapan Mallick et al., Nature 538, 201–206 (13 October 2016) | doi:10.1038/nature18964

Posted by
Charlie Whicher

Charlie Whicher

Product Manager
See all Charlie's posts