[BioC] baySeq issues

Thu Aug 4 16:21:32 CEST 2011

Hello list

I'm having some problems with the baySeq package for RNA-Seq analysis. 

Basically if I filter my data (non-normalised counts at the gene level) to get rid of genes with low expression (<1 count / million as per edgeR manual - this is just a filter choice) I cannot get the plotMA.CD command in baySeq to work. It fails with:

Error in plot.window(...) : need finite 'ylim' values
In addition: Warning message:
In max(abs((log2(Adata) - log2(Bdata))[-union(Azeros, Bzeros)])) :
no non-missing arguments to max; returning -Inf

From the help for this plot it can handle zero values and indeed if I use unfiltered data then it plots without issue. 

So perhaps I should use unfiltered data - trying this, the getPriors.NB function falls down with 

Finding priors...Error in quantile.default(nzSD, 1:sqnum/sqnum) : 
missing values and NaN's not allowed if 'na.rm' is FALSE

I have tried setting 'na.rm=T' thus:
CDPriors <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl, na.rm = T)

with no success.

Here is my code so far using unfiltered data (there's some bookeeping up front I'm afraid):

rm(list=ls())

library(baySeq)
library(edgeR)
library(org.Bt.eg.db)
#library(snow) - if I ever get access to a cluster
library(goseq)

#set the number of cores in our cluster (none ATM)
cl <- NULL

#first we need to get the data in
#for simplicity let's read in the data with edgeR, create a suitable matrix and use that.
#we can define the replicates etc from the edgeR $samples table

#which files are we loading
targets <- read.delim(file = "targets.csv", stringsAsFactors=F)

#get only files for AF and Neg at two hours and drop unused columns
targetFiles <- targets[grep('InfAF|Cont',targets$group),]
targetFiles <- targetFiles[targetFiles$time == 2,]

#make DGEList object 
d <- readDGE(targetFiles, path = '/made/up/path/to/files/for/this/post/')

#now create a matrix from the d$counts and the replicate structure from the d$samples$groups
countMatrix <- as.matrix(d$counts)
colnames(countMatrix) <- paste(d$samples$group, d$sample$description, sep=':')

#prepare annotation DF (gene names )
annot <- data.frame(rownames(countMatrix))

#define the replicates
reps <- as.integer(as.factor(d$samples$group))

#rearrange the countData so we have all group 1 and all group 2 together
#remove row and col names from matrix (i think this is required - maybe not)

countMatrix <- countMatrix[,order(reps)]
colnames(countMatrix) <- NULL; rownames(countMatrix) <- NULL
reps <- reps[order(reps)]

#We are working on the gene level so for baySeq we need the length of each gene. We can get these using the getlengths function of the goseq package

geneLengths <- getlength(as.character(annot[,1]),'bosTau4','ensGene')

#define the group structure

groups <- list(NDE = rep(1,12), DE = c(rep(1,6), rep(2,6)))# NDE = not diff exp, DE = diff exp

#finally let's get the library sizes

libSze <- colSums(countMatrix)

#We begin by combining the count data, library sizes, gene lengths and user-defined groups into a countData object.

CD <- new("countData", data = countMatrix, replicates = reps, libsizes = libSze, groups = groups, seglens = geneLengths, annotation = annot)

#SEEMS OK TO HERE THEN FAILS TO MAKE MA PLOT if data is filtered
#d <- d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, dim(d)) > 1) >= 9, ]
#prior to creating the various components of the CD object

#an MA plot
plotMA.CD(CD, samplesA = 1:6, samplesB = 7:12)

# We first estimate an empirical distribution on the parameters of the Negative Binomial distribution by bootstrapping from the data, taking individual counts and finding the quasi-likelihood parameters for a Negative Binomial distribution.

CDPriors <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl, na.rm = T)

#The above function falls over if using unfiltered data

#then calculate the posteriors

CDPosterior <- getLikelihoods.NB(CDPriors, pET = "BIC", cl = cl, bootstraps = 3, cl = cl)

#get the DE genes
topTags(CDPosterior, group = 2)

#END OF CODE

Does anyone have any advice about where I'm going wrong?

Thanks

Iain

 sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8    
 [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8   
 [7] LC_PAPER=en_GB.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] goseq_1.4.0            geneLenDataBase_0.99.7 BiasedUrn_1.03        
 [4] org.Bt.eg.db_2.5.0     RSQLite_0.9-4          DBI_0.2-5             
 [7] AnnotationDbi_1.14.1   Biobase_2.10.0         edgeR_2.2.5           
[10] baySeq_1.6.0          

loaded via a namespace (and not attached):
 [1] biomaRt_2.6.0         Biostrings_2.20.1     BSgenome_1.20.0      
 [4] GenomicFeatures_1.4.3 GenomicRanges_1.4.6   grid_2.13.1          
 [7] IRanges_1.10.4        lattice_0.19-30       limma_3.6.6          
[10] Matrix_0.999375-50    mgcv_1.7-6            nlme_3.1-101         
[13] RCurl_1.5-0           rtracklayer_1.12.4    tools_2.13.1         
[16] XML_3.4-0            
>