[R-sig-eco] Rarefying metagenome data table

Jacob Cram cramjaco at gmail.com
Thu Mar 10 18:32:57 CET 2016


Dear List,
I have a large data table (its the TARA Oceans metagenomic data and
can be found here http://ocean-microbiome.embl.de/companion.html).
Essentially the columns of the data table are samples taken in
different parts of the ocean and the rows are different genes found in
those locations. The numbers in the body of the table are the number
of times an insturment detects that gene in a sample. Because the
machine returns more reads (number of observed genes) in some samples
than in others, the data need to be "rarefied", that is subsampled
such that there are the same number of genes assigned to each station.
I am trying to use the rrarefy package in vegan to do this, but I keep
getting an error message

    ## Get TARA oceans KEGG metagenomes
    temp = tempfile()
    download.file("http://ocean-microbiome.embl.de/data/TARA243.KO.profile.release.gz",
temp)
    taraKEGG = read.delim(gzfile(temp, "TARA243.KO.profile.release"))
    unlink(temp)

    ## some processing
    ids = taraKEGG[,1]
    data = taraKEGG[,2:dim(taraKEGG)[2]]
    minsamples = min(colSums(data))
    mtx= as.matrix(data)
    imtx=as.integer(as.matrix(data))
    data2 = data
    data2 = sapply(round(data, 0), as.integer)

    ## load library
    library(vegan)

    ## rarify samples
    rare = rrarefy(data2, minsamples)

> Error in if (sum(x[i, ]) <= sample[i]) next :    missing value where TRUE/FALSE needed In
> addition: Warning messages: 1: In rrarefy(data2, minsamples) :   Some
> row sums < 'sample' and are not rarefied 2: In sum(x[i, ]) : integer
> overflow - use sum(as.numeric(.))

It looks like the problem is a line in rrarify where it tries to make
a huge matrix that has as many columns as there are genes in one of
the samples

` row <- sample(rep(nm, times = x[i, ]), sample[i])`

Since there are sometimes millions to tens of millions of genes, this
ends up being a really big matrix and the program gives up in order to
save my computer memory.

So, I am trying to figure out how to proceed from here. Perhaps I am
using this function incorrectly and there is a different way to use it
to not have this problem. Alternatively, perhaps I should be using a
different tool. I have heard of rarefication programs in python, or,
for that matter the bioinfomatics universe of Qime, but I'd rather
stay within R if at all possible. Does anyone have any suggestions?

Thank you for your time.
Sincerely,
Jacob Cram



More information about the R-sig-ecology mailing list