Special thanks to Kemi Ifeonu for co-ordinating this blog post
The Human Microbiome Project (HMP) was established by the NIH Common Fund in 2008, to generate resources that would enable the comprehensive characterization of the human microbiome and help us understand its role in human health and disease.
The first phase of the HMP was designed to provide a “healthy” microbiome baseline. Around 300 healthy individuals were sampled at 15 (men) or 18 (women) body sites. Over the five years of the project, 16S sequence was generated for ~10,000 samples and whole metagenome shotgun (WMS) sequence was generated for ~2,400 samples. In addition, the whole genomes of about 2,800 reference strains isolated from the human body, were sequenced.
The second phase of the HMP is known as the Integrative Human Microbiome Project (iHMP). Using multiple omics technologies, this project integrates longitudinal datasets from both the microbiome and host from three different studies of microbiome-associated conditions: Pregnancy & Preterm Birth, Onset of Inflammatory Bowel Disease (IBD), and Onset of Type 2 Diabetes.
Some of the major datasets generated by the HMP projects are presented here:
About 2,800 reference genomes sequenced from strains isolated from human body sites.
HMP Reference Genomes Project Catalog. This resource provides access to metadata collected for the reference genomes including information such as the body site the organism was isolated from, taxonomy, annotation information and much more.
- Sequence and annotation data is available through the HMP DACC website and at the NCBI
How can you use these datasets?
The reference genomes sequences can serve as a database for read mapping, and the information gained will aid in taxonomic assignment and functional annotation. Users can also build a BLAST database with a subset of the reference sequences (for example by body site) and search their sequence against the database.
Metagenomic datasets for healthy individuals
We collected samples from 300 healthy adult men and women between the ages of 18 and 40, from five major body areas: oral cavity, nasal cavity, skin, gastrointestinal tract and urogenital tract. 16S rRNA and metagenomic WMS sequencing was performed on the samples using the 454 and Illumina platforms respectively.
- Raw reads for both 16S and WMS sequencing as well as processed 16S reads
- Assemblies for each sample as well of as of multiple samples grouped by body site generated from WMS reads
- Annotated gene sequences by sample and a non-redundant version by body site.
- Community profiling data, which estimates organism abundance per sample and per body site. These datasets were generated using different tools and approaches - Shotgun community profiling, Shotgun MetaPHlAn Community Profiling, Mothur community profiling, QIIME community profiling http://www.hmpdacc.org/HMQCP.
- Metabolic analysis was performed using the HMP Unified Metabolic Analysis Network (HUMAnN), and results are available here.
- HMP Metagenomic Project Catalog. This resource provides access to metadata collected for the HMP metagenomic samples including information such as sample name, body site, de-identified subject id, host gender, and visit number. Other clinical metadata (including age, smoking status, BMI, race, etc.) is kept confidential by NIH and accessible only to authorized users through NCBI dbGAP.
How can you use this dataset?
Users can download the raw reads and perform independent analysis either on the full dataset or a subset. Subsets can be generated based on metadata available through the project catalog or clinical metadata available through the NCBI dbGAP.
The processed datasets (e.g., gene indices, community profiles, metabolic pathways) can also be mined for interesting trends.
The iHMP is in progress, and has started generating integrated longitudinal datasets of biological properties. These datasets include: 16S rRNA gene surveys, host genome sequences, host transcriptomes, whole metagenome shotgun sequences, metatranscriptomes, proteomes, and metabolomes. Some of this data is already available but more datasets will be made available soon.