1. Introduction

phyloseq (McMurdie & Holmes, 2013) is a package created for the analysis of high-throughput microbiome data. Specifically, it was designed to import, analyze, and graphically display phylogenetic sequencing data clustered into Operational Taxonomic Units (OTUs), or Amplicon Sequence Variants (ASVs), depending on the raw sequence analysis. phyloseq uses a specialized system of S4 classes, A.K.A phyloseq objects, to store all necessary data and metadata into a single object. This makes working with complex microbiome datasets much easier. Read more here.

As opposed to amplicon sequencing of the 16S rRNA gene used for phylogenetic studies, taxonomic annotation of reads after shotgun metagenomic sequencing does not result in OTUs or ASVs. However, datasets do have a similar structure, i.e. taxonomy, read counts, and metadata, making it possible to use tools such as phyloseq for an easier analysis.

Therefore, this walkthrough shows how to:
- wrangle non-OTU sequencing data using tidyverse (Wickham et al., 2019) into data that phyloseq can use as an input,
- initially inspect data with ordination plots and stacked barplots.

Data used for this walkthrough (Table 1) originates from shotgun metagenomic sequencing reads that were taxonomically classified using Kaiju (Menzel et. al., 2016) and the NCBI-nr database.¹

Data structure

After classification, the taxonomy table looks like this:

Table 1

The first column (SampleID) is taxonomy, which is delimited by semicolons, and other columns are sample names.

The matching metadata table looks like this:

Table 2

phyloseq requires three tables:
- an OTU table
- a taxonomy table
- a metadata table.

2. Data input and wrangling

First, we will load the libraries and then input the tables.

As Table 1 consists of both taxonomy and the number of assigned reads, we will first input the whole table, and then split it up so that the first column becomes a taxonomy table, and the other columns become an OTU table. Then we will input the metadata table, and create a phyloseq object that unifies all three.

library(tidyverse)
library(phyloseq)

2.1. OTU table

OTUs have to be row names and therefore unique, which is the crucial feature. In Table 1, the row names are not OTU names, but they are all unique, so we will use this first column of Table 1 as the row names for the OTU table.

otu<- read.delim("otutable.species.txt", sep = "\t", header = TRUE)

#always check the structure of the object to make sure everything is as you expect
str(otu)
## 'data.frame':    20635 obs. of  10 variables:
##  $ SampleID: Factor w/ 20635 levels "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui",..: 1 2 3 5 6 7 8 9 11 12 ...
##  $ RA0_A   : int  263 218 4 229 919 626 254 246 215 8 ...
##  $ RA1_A   : int  181 169 4 196 640 462 231 218 146 6 ...
##  $ RA1_B   : int  184 146 5 159 493 384 151 156 140 4 ...
##  $ RA1_C   : int  213 172 4 196 635 455 218 189 171 7 ...
##  $ RA3_A   : int  220 184 7 211 597 469 216 232 180 5 ...
##  $ RA0_B   : int  143 123 0 153 425 337 133 160 116 1 ...
##  $ RA3_B   : int  262 220 4 254 713 520 233 2 173 3 ...
##  $ RA3_C   : int  177 153 5 195 560 392 184 192 155 4 ...
##  $ RA0_C   : int  86 60 1 89 255 209 90 74 65 0 ...

#OTU ID's have to be the rownames
rownames(otu)<- otu$SampleID

#check if the number of reads per sample is correct 
# (e.g. sample RA0_A has 1364214 bacterial reads, so make sure that is correct)
count <- colSums(otu[, c(2:ncol(otu))])

count
##   RA0_A   RA1_A   RA1_B   RA1_C   RA3_A   RA0_B   RA3_B   RA3_C   RA0_C 
## 1364214  913696  744653  825750 1164963 1147683 1251035  986966  728167

2.2. Taxonomy table

The taxonomy table is created from the SampleID column of Table 1. This column consists of taxonomy levels separated by semicolons, which we will split into different columns.

taxa<- otu %>% 
  select(SampleID) %>% 
  separate(SampleID, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"),
           "; ")  
## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 1 rows [20635].

 
#This warning appears because the last row are Unassigned reads, and they do not have taxa levels.

str(taxa) 
## 'data.frame':    20635 obs. of  7 variables:
##  $ Domain : chr  "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
##  $ Phylum : chr  "Acidobacteria" "Acidobacteria" "Acidobacteria" "Acidobacteria" ...
##  $ Class  : chr  "Acidobacteriia" "Acidobacteriia" "Acidobacteriia" "Acidobacteriia" ...
##  $ Order  : chr  "Acidobacteriales" "Acidobacteriales" "Acidobacteriales" "Acidobacteriales" ...
##  $ Family : chr  "Acidobacteriaceae" "Acidobacteriaceae" "Acidobacteriaceae" "Acidobacteriaceae" ...
##  $ Genus  : chr  "Acidobacterium" "Acidobacterium" "Acidobacterium" "Bryocella" ...
##  $ Species: chr  "Acidobacterium ailaaui" "Acidobacterium capsulatum" "NA" "Bryocella elongata" ...

#the output is a data frame of characters, and we need taxa to be recognized as factors
taxa<- taxa %>% 
  mutate_if(is.character, as.factor)

str(taxa)
## 'data.frame':    20635 obs. of  7 variables:
##  $ Domain : Factor w/ 2 levels "Bacteria","Unassigned": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Phylum : Factor w/ 136 levels "","Acidobacteria",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Class  : Factor w/ 74 levels "Acidimicrobiia",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Order  : Factor w/ 170 levels "Acholeplasmatales",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Family : Factor w/ 385 levels "Acaryochloridaceae",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Genus  : Factor w/ 2017 levels "Abiotrophia",..: 27 27 27 277 356 664 831 831 831 831 ...
##  $ Species: Factor w/ 19188 levels "'Deinococcus soli' Cha et al. 2014",..: 317 318 11137 2924 4095 7112 8421 8422 8424 11137 ...


#add the first column from the OTU table so that OTU and taxa tables match
taxa<- cbind(otu$SampleID, taxa)

#rename the first column
colnames(taxa)[1]<- "SampleID"
str(taxa)
## 'data.frame':    20635 obs. of  8 variables:
##  $ SampleID: Factor w/ 20635 levels "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui",..: 1 2 3 5 6 7 8 9 11 12 ...
##  $ Domain  : Factor w/ 2 levels "Bacteria","Unassigned": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Phylum  : Factor w/ 136 levels "","Acidobacteria",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Class   : Factor w/ 74 levels "Acidimicrobiia",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Order   : Factor w/ 170 levels "Acholeplasmatales",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Family  : Factor w/ 385 levels "Acaryochloridaceae",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Genus   : Factor w/ 2017 levels "Abiotrophia",..: 27 27 27 277 356 664 831 831 831 831 ...
##  $ Species : Factor w/ 19188 levels "'Deinococcus soli' Cha et al. 2014",..: 317 318 11137 2924 4095 7112 8421 8422 8424 11137 ...

#OTU ID's have to be the rownames, same as in the OTU table
rownames(taxa)<- taxa$SampleID

#now you can delete the first columns of both OTU and taxa tables
# because they are now the rownames, and therefore redundant in the first column
otu<- otu %>% 
  select(-SampleID)

taxa<- taxa %>% 
  select(-SampleID)

2.3. Metadata table

The metadata table contains a description of sample names (Table 2). This can include everything that describes a sample; sample origin, type, sampling time etc., depending on the experimental design.
Samples in this walkthrough are soil samples taken from fields of agricultural reclamation sites of different age, e.g. RA1_A = one year old reclamation field, plot A.

meta<- read.delim("metadata.txt", sep = "\t", header = TRUE)
str(meta)
## 'data.frame':    9 obs. of  3 variables:
##  $ SampleID: Factor w/ 9 levels "RA0_A","RA0_B",..: 1 4 5 6 7 2 8 9 3
##  $ Age     : Factor w/ 3 levels "RA0","RA1","RA3": 1 2 2 2 3 1 3 3 1
##  $ Plot    : Factor w/ 3 levels "a","b","c": 1 1 2 3 1 2 2 3 3

#make sample names row names 
rownames(meta)<- meta$SampleID

#and delete the first column because it is now redundant
meta<- meta %>% 
        select(-SampleID)

#make proper levels of the factor so that they are in the wanted order in figures
meta$Age<- factor(meta$Age, levels = c("RA0", "RA1", "RA3"))
meta$Plot<- factor(meta$Plot, levels = c("a", "b", "c"))

3. phyloseq

OTU and taxonomy tables have to be matrices, and then all three tables have to be transformed to phyloseq objects before unifying them into one object.

otu_mat<- as.matrix(otu)
tax_mat<- as.matrix(taxa)

#transform data to phyloseq objects
phylo_OTU<- otu_table(otu_mat, taxa_are_rows = TRUE)
phylo_TAX<- tax_table(tax_mat)
phylo_samples<- sample_data(meta)

#and put them in one object
phylo_object<- phyloseq(phylo_OTU, phylo_TAX, phylo_samples)

#check if everything looks good
sample_sums(phylo_object)       #should sum up the number of all reads per sample
##   RA0_A   RA1_A   RA1_B   RA1_C   RA3_A   RA0_B   RA3_B   RA3_C   RA0_C 
## 1364214  913696  744653  825750 1164963 1147683 1251035  986966  728167

sample_names(phylo_object)      #sample names
## [1] "RA0_A" "RA1_A" "RA1_B" "RA1_C" "RA3_A" "RA0_B" "RA3_B" "RA3_C" "RA0_C"

rank_names(phylo_object)        #taxa levels  
## [1] "Domain"  "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

sample_variables(phylo_object)  #factors
## [1] "Age"  "Plot"

otu_table(phylo_object)[1:3, 1:2]
## OTU Table:          [3 taxa and 2 samples]
##                      taxa are rows
##                                                                                                                         RA0_A
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui      263
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum   218
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA                            4
##                                                                                                                         RA1_A
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui      181
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum   169
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA                            4

taxa_names(phylo_object)[1:5]
## [1] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui"                 
## [2] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum"              
## [3] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA"                                     
## [4] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Bryocella; Bryocella elongata"                          
## [5] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis"

3.1. Relativization

Subsequent analyses are based on relative data. Since here we have only the reads that were assigned to Bacteria, we will relativize them to the total number of bacterial reads per sample and show them as percentages.

phylo_rel<- transform_sample_counts(phylo_object, function(x) x*100 / sum(x))

#check if everything looks good
sample_sums(phylo_rel)          #should sum up to 100% per sample
## RA0_A RA1_A RA1_B RA1_C RA3_A RA0_B RA3_B RA3_C RA0_C 
##   100   100   100   100   100   100   100   100   100

sample_names(phylo_rel)         #sample names
## [1] "RA0_A" "RA1_A" "RA1_B" "RA1_C" "RA3_A" "RA0_B" "RA3_B" "RA3_C" "RA0_C"

rank_names(phylo_rel)           #taxa levels  
## [1] "Domain"  "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

sample_variables(phylo_rel)     #factors
## [1] "Age"  "Plot"

otu_table(phylo_rel)[1:3, 1:2]
## OTU Table:          [3 taxa and 2 samples]
##                      taxa are rows
##                                                                                                                                RA0_A
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui    0.0192785003
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum 0.0159798976
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA                        0.0002932091
##                                                                                                                                RA1_A
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui    0.0198096522
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum 0.0184963051
## Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA                        0.0004377824

taxa_names(phylo_rel)[1:5]
## [1] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium ailaaui"                 
## [2] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum"              
## [3] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Acidobacterium; NA"                                     
## [4] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Bryocella; Bryocella elongata"                          
## [5] "Bacteria; Acidobacteria; Acidobacteriia; Acidobacteriales; Acidobacteriaceae; Candidatus Koribacter; Candidatus Koribacter versatilis"

#to start getting the feel of the data, let´s check what had the highest abundance

max(phylo_rel@otu_table)        #the highest relative abundance
## [1] 2.744208

rownames(phylo_rel@otu_table)[which.max(apply(phylo_rel@otu_table,MARGIN=1,max))] #row name
## [1] "Bacteria; Acidobacteria; NA; NA; NA; NA; NA"
# of the highest relative abundance

#the NAs appear when reads were not assigned down to the respective taxonomy level.  

phylo_rel@otu_table['Bacteria; Acidobacteria; NA; NA; NA; NA; NA'] #extract that whole row
## OTU Table:          [1 taxa and 9 samples]
##                      taxa are rows
##                                                RA0_A    RA1_A   RA1_B    RA1_C
## Bacteria; Acidobacteria; NA; NA; NA; NA; NA 1.853741 2.357568 2.33921 2.728671
##                                                RA3_A   RA0_B    RA3_B    RA3_C
## Bacteria; Acidobacteria; NA; NA; NA; NA; NA 2.178181 1.15145 2.744208 2.177583
##                                                RA0_C
## Bacteria; Acidobacteria; NA; NA; NA; NA; NA 1.184206

3.2. Visual inspection of the data

3.2.1. Ordination

To get the first visual overview of the data, we will plot an ordination. phyloseq has built-in functions (incorporated from the vegan package (Oksanen et al., 2019)), which calculate different ordinations based on different distance matrices. Here, we will make a non-metric multidimensional scaling (NMDS) and a PCoA ordination based on the Bray-Curtis dissimilarity and then plot them.

#NMDS plot
phylo_rel_nmds<- ordinate(phylo_rel, method = "NMDS", distance = "bray")
## Run 0 stress 6.744911e-05 
## Run 1 stress 9.675111e-05 
## ... Procrustes: rmse 0.00264946  max resid 0.007314832 
## ... Similar to previous best
## Run 2 stress 9.774256e-05 
## ... Procrustes: rmse 0.006300876  max resid 0.01746 
## Run 3 stress 9.842386e-05 
## ... Procrustes: rmse 0.1186001  max resid 0.2961391 
## Run 4 stress 8.647363e-05 
## ... Procrustes: rmse 0.05988903  max resid 0.1604638 
## Run 5 stress 0.2510796 
## Run 6 stress 9.9043e-05 
## ... Procrustes: rmse 0.01669719  max resid 0.04634518 
## Run 7 stress 9.064597e-05 
## ... Procrustes: rmse 0.06823131  max resid 0.1814567 
## Run 8 stress 9.468441e-05 
## ... Procrustes: rmse 0.09917985  max resid 0.2547392 
## Run 9 stress 9.700604e-05 
## ... Procrustes: rmse 0.01930361  max resid 0.05359984 
## Run 10 stress 9.882028e-05 
## ... Procrustes: rmse 0.05874377  max resid 0.1575471 
## Run 11 stress 9.311463e-05 
## ... Procrustes: rmse 0.0526409  max resid 0.1418732 
## Run 12 stress 8.836167e-05 
## ... Procrustes: rmse 0.02634627  max resid 0.07220011 
## Run 13 stress 9.96323e-05 
## ... Procrustes: rmse 0.01126281  max resid 0.0312384 
## Run 14 stress 8.379194e-05 
## ... Procrustes: rmse 0.1037141  max resid 0.2647588 
## Run 15 stress 0.2786511 
## Run 16 stress 9.28683e-05 
## ... Procrustes: rmse 0.1266287  max resid 0.311991 
## Run 17 stress 9.805165e-05 
## ... Procrustes: rmse 0.02206489  max resid 0.06127762 
## Run 18 stress 9.599063e-05 
## ... Procrustes: rmse 0.004330536  max resid 0.01194683 
## Run 19 stress 9.377851e-05 
## ... Procrustes: rmse 0.01804013  max resid 0.04962881 
## Run 20 stress 9.033382e-05 
## ... Procrustes: rmse 0.07469049  max resid 0.1973813 
## *** Solution reached
## Warning in metaMDS(veganifyOTU(physeq), distance, ...): stress is (nearly) zero:
## you may have insufficient data


#we can check what the stressplot (Shepard plot) looks like
vegan::stressplot(phylo_rel_nmds)

#R2 is 1, but the plot looks strange and the warning after the NMDS calculation suggest that NMDS
# might not be the best choice for this data.

#NMDS plot
plot_ordination(phylo_rel, phylo_rel_nmds,
                color = "Age", shape = "Plot") +
  theme_bw()

#PCoA
phylo_rel_pcoa<- ordinate(phylo_rel, method = "PCoA", distance = "bray")

#PCoA plot
plot_ordination(phylo_rel, phylo_rel_pcoa,
                color = "Age", shape = "Plot") +
  theme_bw()

3.2.2. Stacked barplots

For stacked barplots, we will work with means of replicates using the merge_samples function. The default of this function is merge_samples(x, group, fun= mean).

However, in further description of the arguments, specifically fun, it says:

Default is mean. Note that this is (currently) ignored for the otu_table, where the equivalent function is sum, but evaluated via rowsum for efficiency.

Therefore, in order to really get the average, we have to divide the result we get with the default function by the number of replicates.

phylo_rel_mean <- merge_samples(phylo_rel, "Age")

sample_sums(phylo_rel_mean) #sums up to 300!
## RA0 RA1 RA3 
## 300 300 300

#now divide all relative OTU abundances by the number of replicates
phylo_rel_mean<- transform_sample_counts(phylo_rel_mean, function(x) x / 3)

sample_sums(phylo_rel_mean)
## RA0 RA1 RA3 
## 100 100 100

Stacked barplots can show taxonomic levels of choice. We will create a plot that shows the phylum level. First, we will agglomerate data to the phylum level using the tax_glom function, and then plot the agglomerated data.

Be careful when using the tax_glom function because the default behaviour is to remove NAs (those which are unassigned on the desired level), or agglomerate all of them together. This also includes empty cells or any strange/bad cells. Read more about this on ?tax_glom before use. Always check this using the sample_sums function, and by plotting the resulting object. With stacked barplots, the NA stack will be large if there are a lot of NAs.

You can fix this by manually changing the cells. For example, if you are doing an analysis of highly abundant bacterial families and you end up with a lot of NAs, you can check the taxonomy table and see rows that look something like this:

‘Bacteria; Chloroflexi; NA; NA; NA; NA; Chloroflexi bacterium OLB14’

You can manually change this row to replace the NA with a value in the family column:

physeq_top_fam@tax_table[‘Bacteria; Chloroflexi; NA; NA; NA; NA; Chloroflexi bacterium OLB14’,‘Family’]<- “uncultured Chloroflexi”

By doing this, the NA stack in your stacked barplot will shrink, and a new “uncultured Chloroflexi” stack will appear. You can do this for as many rows as you want.

#check taxonomy level names
rank_names(phylo_rel_mean)
## [1] "Domain"  "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

# agglomeration on the phylum level
phylo_rel_rel_mean_phylum<- tax_glom(phylo_rel_mean, taxrank = "Phylum")

#check the number of taxa in the whole phyloseq object
phylo_rel_mean 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20635 taxa and 3 samples ]
## sample_data() Sample Data:       [ 3 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 20635 taxa by 7 taxonomic ranks ]

#check to how many phyla were these assigned to
phylo_rel_rel_mean_phylum 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 135 taxa and 3 samples ]
## sample_data() Sample Data:       [ 3 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 135 taxa by 7 taxonomic ranks ]

#check sample sums to make sure nothing was deleted due to NAs (should sum up 100%)
sample_sums(phylo_rel_rel_mean_phylum)
## RA0 RA1 RA3 
## 100 100 100

plot_bar(phylo_rel_rel_mean_phylum) +
  theme_bw()

#if we try to color by Phylum to see their names, 
# we can see that 135 are too many to see in a stacked barplot
plot_bar(phylo_rel_rel_mean_phylum, fill = "Phylum")

Since 135 phyla are too many to plot in a stacked barplot, we will filter the low abundant ones and put them into one category. To do this, we will use the power of tidyverse again. First, we will create a normal data frame out of the phyloseq object and then add another column where all taxa with abundance lower than 1% will be renamed to “< 1%”.

#transform phyloseq object to a data frame
phylo_rel_rel_mean_phylumDF<- psmelt(phylo_rel_rel_mean_phylum)

str(phylo_rel_rel_mean_phylumDF)
## 'data.frame':    405 obs. of  7 variables:
##  $ OTU      : chr  "Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas jaspsi" "Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas jaspsi" "Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas jaspsi" "Bacteria; Actinobacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces; NA" ...
##  $ Sample   : chr  "RA0" "RA3" "RA1" "RA0" ...
##  $ Abundance: num  45.3 41.6 40.5 21.9 17.6 ...
##  $ Age      : num  1 3 2 1 3 2 1 2 3 2 ...
##  $ Plot     : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ Domain   : Factor w/ 1 level "Bacteria": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Phylum   : Factor w/ 135 levels "Acidobacteria",..: 128 128 128 2 2 2 5 1 5 5 ...

#make the phyla characters, not factors
phylo_rel_rel_mean_phylumDF$Phylum<- as.character(phylo_rel_rel_mean_phylumDF$Phylum)

#add new column with renamed low abundant taxa
phylo_rel_rel_mean_phylumDF<- phylo_rel_rel_mean_phylumDF %>% 
  mutate(Phylum2 = replace(Phylum, Abundance < 1, "< 1%"))

#check all phyla names
unique(phylo_rel_rel_mean_phylumDF$Phylum2)
##  [1] "Proteobacteria"   "Actinobacteria"   "Bacteroidetes"    "Acidobacteria"   
##  [5] "Chloroflexi"      "Planctomycetes"   "Verrucomicrobia"  "NA"              
##  [9] "Firmicutes"       "Gemmatimonadetes" "Cyanobacteria"    "Nitrospirae"     
## [13] "< 1%"

#there are some reads that were assigned only to the domain level, 
# i.e. NA on the phylum level, so we will rename them
phylo_rel_rel_mean_phylumDF<- phylo_rel_rel_mean_phylumDF %>% 
  mutate(Phylum2 = replace(Phylum2, Phylum2 == "NA", "unassigned Bacteria"))

#reorder the phyla so that they are stacked according to abundance
phylo_rel_rel_mean_phylumDF$Phylum2<- reorder(phylo_rel_rel_mean_phylumDF$Phylum2,
                                              phylo_rel_rel_mean_phylumDF$Abundance)

#check how many unique phyla are there to find discrete colors for them
unique(phylo_rel_rel_mean_phylumDF$Phylum2)
##  [1] Proteobacteria      Actinobacteria      Bacteroidetes      
##  [4] Acidobacteria       Chloroflexi         Planctomycetes     
##  [7] Verrucomicrobia     unassigned Bacteria Firmicutes         
## [10] Gemmatimonadetes    Cyanobacteria       Nitrospirae        
## [13] < 1%               
## 13 Levels: < 1% Nitrospirae Cyanobacteria Gemmatimonadetes ... Proteobacteria

ggplot(phylo_rel_rel_mean_phylumDF, aes(Sample, Abundance, fill=Phylum2)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#bd6b8f",
                              "#6db543",
                              "#7661cd",
                              "#c0b047",
                              "#c35abc",
                              "#60bf8b",
                              "#d13f73",
                              "#3e8149",
                              "#ca5340",
                              "#45b0cf",
                              "#cc8444",
                              "#7882c9",
                              "#7a7732")) +
  labs(y= "Relative abundance [%]",
       fill= "Phlya") +
  theme_bw()

Tip: it can be rather difficult to find acceptable colors when you have many discrete variables. Check out this great open-source tool, which allows you to randomly generate as many colors as you need, and gives them in a nice, easily copy-pasteable form - https://medialab.github.io/iwanthue/

4. Conclusion

There are a lot of bioinformatic tools out there that were designed for specific data types. If we learn how to transform and adapt our data, we can use the power of great open-source tools without having to “reinvent the wheel” for every new dataset.

However, one should always use these tools with care. Most of us probably don´t look “under the hood” to check how every function works, which can result in mistakes or unwanted outcomes. To minimize this, one should always strive to be mindful during the analysis, and double-check all the steps to make sure the output is reasonable and correct.

Acknowledgments

A big thank you to everyone who posts their problems and solutions on the internet, and to the wonderful #rstats online community. Special thanks goes to Dr. Antonios Michas for discussions and ideas, and Johan Sebastián Sáenz for great feedback and suggestions that improved this walkthrough.

References

McMurdie, P. J., Holmes, S. (2013) pyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8(4):e61217.

Menzel, P., Ng, K.L., and Krogh, A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications 7: 11257.

Jari Oksanen, F. Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, R. B. O’Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, Eduard Szoecs and Helene Wagner (2019). vegan: Community Ecology Package. R package version 2.5-6.

Wickham et al. (2019) Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Operation and session info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] phyloseq_1.24.2 forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
##  [5] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
##  [9] ggplot2_3.2.1   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.40.0      httr_1.4.1          splines_3.6.1      
##  [4] jsonlite_1.6        foreach_1.4.7       modelr_0.1.5       
##  [7] assertthat_0.2.1    stats4_3.6.1        cellranger_1.1.0   
## [10] yaml_2.2.0          pillar_1.4.2        backports_1.1.5    
## [13] lattice_0.20-38     glue_1.3.1          digest_0.6.22      
## [16] XVector_0.20.0      rvest_0.3.5         colorspace_1.4-1   
## [19] htmltools_0.4.0     Matrix_1.2-17       plyr_1.8.4         
## [22] pkgconfig_2.0.3     broom_0.5.2         haven_2.2.0        
## [25] zlibbioc_1.26.0     scales_1.1.0        mgcv_1.8-28        
## [28] farver_2.0.1        generics_0.0.2      IRanges_2.14.12    
## [31] ellipsis_0.3.0      withr_2.1.2         BiocGenerics_0.26.0
## [34] lazyeval_0.2.2      cli_1.1.0           survival_3.1-7     
## [37] magrittr_1.5        crayon_1.3.4        readxl_1.3.1       
## [40] evaluate_0.14       fs_1.3.1            nlme_3.1-140       
## [43] MASS_7.3-51.4       xml2_1.2.2          vegan_2.5-6        
## [46] tools_3.6.1         data.table_1.12.6   hms_0.5.2          
## [49] lifecycle_0.1.0     Rhdf5lib_1.2.1      S4Vectors_0.18.3   
## [52] munsell_0.5.0       reprex_0.3.0        cluster_2.1.0      
## [55] Biostrings_2.48.0   ade4_1.7-13         compiler_3.6.1     
## [58] rlang_0.4.1         rhdf5_2.24.0        grid_3.6.1         
## [61] iterators_1.0.12    biomformat_1.8.0    rstudioapi_0.10    
## [64] igraph_1.2.4.1      labeling_0.3        rmarkdown_1.17     
## [67] gtable_0.3.0        codetools_0.2-16    multtest_2.36.0    
## [70] DBI_1.0.0           reshape2_1.4.3      R6_2.4.1           
## [73] lubridate_1.7.4     knitr_1.26          zeallot_0.1.0      
## [76] permute_0.9-5       ape_5.3             stringi_1.4.3      
## [79] parallel_3.6.1      Rcpp_1.0.3          vctrs_0.2.0        
## [82] dbplyr_1.4.2        tidyselect_0.2.5    xfun_0.11

When using an approach of this kind to analyze metagenomic data, it might be worth to extract 16S reads and analyze them for comparison (however, the number of reads assigned to 16S in metagenomic datasets is usually quite low).↩︎

A work by Miljenka Vuko

miljenkavuko@gmail.com

Wrangling shotgun metagenomic data for analysis with phyloseq

Miljenka Vuko

12/9/2019