Enterotypes: Reference-Based Assignments

Enterotype Assignments

An enterotype classification model, fit on 278 MetaHIT samples (E. Le Chatelier et al., 2013), allows for the assignment of enterotypes to taxonomic profiles generated from new human gut samples. To ensure that the model is applicable, it will first be checked whether the input samples have a microbial composition similar to that of other stools samples from the HMP and MetaHIT studies (C. Huttenhower et al., 2012, E. Le Chatelier et al., 2013).

Please select a tab delimited file with relative abundances summarized at genus level. An example file is available here. You can directly run the classifier on this example dataset.

Select an input file:

Background information

Enterotype assignments (ET_B, ET_P or ET_F, representing the Bacteroides, Prevotella and Firmicutes enterotypes, respectively) provided by this classification tool are independent of de novo clustering. This allows for robust identification based on the original enterotype definition (M. Arumugam et al., 2011), also in data sets that are too small for de novo clustering.

For this we have trained a classifier, based on 278 MetaHIT samples (E. Le Chatelier et al., 2013), and shown that it accurately recovers the intrinsic clustering found in a large Chinese microbiome study on type 2 diabetes(J. Qin et al., 2012) and in the fecal microbiome data from the HMP (C. Huttenhower et al., 2012) (see Supplementary Figure 10 in manuscript).

To ensure meaningful results, i.e. applicability of the enterotype classifier, it is checked whether input samples are similar to previously observed stool samples in genus composition, based on their distances to the HMP (C. Huttenhower et al., 2012) and MetaHIT (E. Le Chatelier et al., 2013) datasets. We note that this is the case for samples from a Chinese type 2 diabetes study (J. Qin et al., 2012), though it does not, for example, apply to Malawian infant gut microbiome samples (from T. Yatsunenko et al., 2012) or to samples from other human body-sites (collected by the HMP, C. Huttenhower et al., 2012). For each input sample, we provide a (boolean) likelihood for it being drawn from the reference space so that the user can judge the applicability of enterotype assignments to their data.


The flow diagram presents steps for determining enterotype assignment based on microbial abundance data. Two main routes to enterotypin are depicted: either de-novo identification of enterotypes (discovery, for which the methodology was previously described) or assignment based on a given reference dataset, the latter of which is implemented here (highlighted by yelow box). The suitability of existing models imposed on the data to describe the composition landscape (1) can for instance be assessed by determining the existence of cluster structure, using one of the proposed clustering validation measures or by using a DMM modeling framework (I. Holmes et al., 2012).

This web server facilitates reference-based enterotype assignment (2). As a first step it checks whether samples are within the enterotype (reference) space based on similarity in composition to adult human stool samples from the HMP (C. Huttenhower et al., 2012) and MetaHIT (E. Le Chatelier et al., 2013) studies. There are many reasons why fecal samples may have different compositional structure (3), for example ones from non-Western individuals, or infants. Also technical issues potentially skewing taxonomic profiles need to be considered; these include DNA extraction, PCR primers and/or bioinformatics preprocessing. The consistency of the separation (4) obtained from the classifier can be validated using a Silhouette index.

Input format

The classifier is trained on taxonomic profiles summarized at genus level (as described in the manuscript and used for the initial enterotype discovery by M. Arumugam et al., 2011). Thus, the input should be a tab-delimited file, with genera as rows and samples as columns. It can be optionally compressed with GZIP or BZIP2. The first line must contain the list of sample IDs. All other lines should start with the genus taxonomic name (that will internally be matched to the NCBI taxonomy database) or a semicolon(;)-separated taxonomic lineage as used in e.g. greengenes, followed by the relative abundance values for each sample; it is important the these relative abundance values sum to 1 for each sample (i.e. each column).

For example, the file may look like:


Output format

The classifier produces two values for each sample. Firstly, the enterotype assignment, encoded as "ET_B" (Bacteroides enriched), "ET_P" (Prevotella enriched) and "ET_F" (Firmicutes enriched). Secondly, a boolean value indicating whether the given sample is compositionally similar to other stool samples from the reference space. If this value is false, strong dissimilarity to stool microbiome profiles of the refererence space indicates that the enterotype assignment for this (outlier) sample might be inappropriate. If more than 50% of the values in this column are false, it is likely that the dataset that was uploaded has a considerable batch effect (or indeed does not come from human stool samples) and reference-based enterotype assignments should not be trusted. The raw analysis output summarizes the results as a tab-separated text file with sample identifiers, enterotype assignments and stool-reference similarity in its columns (see header).