[R-sig-eco] How to deal with rare species post-rarefaction - subsampling or not?

Wed Oct 7 16:34:36 CEST 2015

Dear list,

I have six fields with 60 samples, and i want to analyze the microbial 
diversity based on high throughput sequencing.
The read range between samples was about one magnitude (i. e the samples 
with the highest reads had about tenfold more than those with the least 
read numbers).

i have done rarefaction based on Hill numbers (Chao and Joust, 2014) and 
i found out that i reached full coverage with n=1, and n=2, 
respectively, and i was reaching plateau for n=0 for all samples. My 
lowest sample completeness value was 0.995 for a sample with about 
30,000 observations.
See here an example of one the six fields (from top to bottom: 
Rarefaction based on species richness, linearized simpson, linearized 
shannon; the dots represent the end of each curve, after which the curve 
was extrapolated according to the aforementioned paper).

http://s21.postimg.org/fm0nhp4w7/image.png

I did species richness boxplots based on uncorrected species richness 
and on corrected values (for which i used the "double 
reference"-approach (see the paper)), and they were substantially 
different, with some of the high read samples losing about 20% of their 
observed species richness.

Now, on to the question(s):

- One of my wishes is to identify shared species and core species sets 
in the entirety of the six fields or subsets. I would like to use my 
entire dataset without subsampling, since i have such a high sample 
coverage, but this obviously has impact on the interpretation of the 
data. However, if i subsample, dont i have to do it in many 
permutations? And wouldnt subsampling also have severe impact on the 
interpretatory power of my analysis as well?
- I have yet to find a nice subsampling routine in R for community data, 
that enables me to do further calculations on the entire set of n 
subsets, possibly in lists.
- As a bonus, if i want to use Chao-1 as an index of expected species 
richness, do i do it on subsampled datasets or on samples as they are? I 
would rather do it on raw data (because this is what i have measured), 
but i fear for sample comparability.

I think i have shifted the problem of subsampling now to the area of 
rare and very rare biospheres.
Sorry for bothering, many thanks for reading it.

-- 
Tim Richter-Heitmann (M.Sc.)
PhD Candidate

International Max-Planck Research School for Marine Microbiology
University of Bremen
Microbial Ecophysiology Group (AG Friedrich)
FB02 - Biologie/Chemie
Leobener Straße (NW2 A2130)
D-28359 Bremen
Tel.: 0049(0)421 218-63062
Fax: 0049(0)421 218-63069