[BioC] baySeq issues
Iain Gallagher
iaingallagher at btopenworld.com
Thu Aug 4 16:21:32 CEST 2011
Hello list
I'm having some problems with the baySeq package for RNA-Seq analysis.
Basically if I filter my data (non-normalised counts at the gene level) to get rid of genes with low expression (<1 count / million as per edgeR manual - this is just a filter choice) I cannot get the plotMA.CD command in baySeq to work. It fails with:
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning message:
In max(abs((log2(Adata) - log2(Bdata))[-union(Azeros, Bzeros)])) :
no non-missing arguments to max; returning -Inf
From the help for this plot it can handle zero values and indeed if I use unfiltered data then it plots without issue.
So perhaps I should use unfiltered data - trying this, the getPriors.NB function falls down with
Finding priors...Error in quantile.default(nzSD, 1:sqnum/sqnum) :
missing values and NaN's not allowed if 'na.rm' is FALSE
I have tried setting 'na.rm=T' thus:
CDPriors <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl, na.rm = T)
with no success.
Here is my code so far using unfiltered data (there's some bookeeping up front I'm afraid):
rm(list=ls())
library(baySeq)
library(edgeR)
library(org.Bt.eg.db)
#library(snow) - if I ever get access to a cluster
library(goseq)
#set the number of cores in our cluster (none ATM)
cl <- NULL
#first we need to get the data in
#for simplicity let's read in the data with edgeR, create a suitable matrix and use that.
#we can define the replicates etc from the edgeR $samples table
#which files are we loading
targets <- read.delim(file = "targets.csv", stringsAsFactors=F)
#get only files for AF and Neg at two hours and drop unused columns
targetFiles <- targets[grep('InfAF|Cont',targets$group),]
targetFiles <- targetFiles[targetFiles$time == 2,]
#make DGEList object
d <- readDGE(targetFiles, path = '/made/up/path/to/files/for/this/post/')
#now create a matrix from the d$counts and the replicate structure from the d$samples$groups
countMatrix <- as.matrix(d$counts)
colnames(countMatrix) <- paste(d$samples$group, d$sample$description, sep=':')
#prepare annotation DF (gene names )
annot <- data.frame(rownames(countMatrix))
#define the replicates
reps <- as.integer(as.factor(d$samples$group))
#rearrange the countData so we have all group 1 and all group 2 together
#remove row and col names from matrix (i think this is required - maybe not)
countMatrix <- countMatrix[,order(reps)]
colnames(countMatrix) <- NULL; rownames(countMatrix) <- NULL
reps <- reps[order(reps)]
#We are working on the gene level so for baySeq we need the length of each gene. We can get these using the getlengths function of the goseq package
geneLengths <- getlength(as.character(annot[,1]),'bosTau4','ensGene')
#define the group structure
groups <- list(NDE = rep(1,12), DE = c(rep(1,6), rep(2,6)))# NDE = not diff exp, DE = diff exp
#finally let's get the library sizes
libSze <- colSums(countMatrix)
#We begin by combining the count data, library sizes, gene lengths and user-defined groups into a countData object.
CD <- new("countData", data = countMatrix, replicates = reps, libsizes = libSze, groups = groups, seglens = geneLengths, annotation = annot)
#SEEMS OK TO HERE THEN FAILS TO MAKE MA PLOT if data is filtered
#d <- d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size, dim(d)) > 1) >= 9, ]
#prior to creating the various components of the CD object
#an MA plot
plotMA.CD(CD, samplesA = 1:6, samplesB = 7:12)
# We first estimate an empirical distribution on the parameters of the Negative Binomial distribution by bootstrapping from the data, taking individual counts and finding the quasi-likelihood parameters for a Negative Binomial distribution.
CDPriors <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl, na.rm = T)
#The above function falls over if using unfiltered data
#then calculate the posteriors
CDPosterior <- getLikelihoods.NB(CDPriors, pET = "BIC", cl = cl, bootstraps = 3, cl = cl)
#get the DE genes
topTags(CDPosterior, group = 2)
#END OF CODE
Does anyone have any advice about where I'm going wrong?
Thanks
Iain
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] goseq_1.4.0 geneLenDataBase_0.99.7 BiasedUrn_1.03
[4] org.Bt.eg.db_2.5.0 RSQLite_0.9-4 DBI_0.2-5
[7] AnnotationDbi_1.14.1 Biobase_2.10.0 edgeR_2.2.5
[10] baySeq_1.6.0
loaded via a namespace (and not attached):
[1] biomaRt_2.6.0 Biostrings_2.20.1 BSgenome_1.20.0
[4] GenomicFeatures_1.4.3 GenomicRanges_1.4.6 grid_2.13.1
[7] IRanges_1.10.4 lattice_0.19-30 limma_3.6.6
[10] Matrix_0.999375-50 mgcv_1.7-6 nlme_3.1-101
[13] RCurl_1.5-0 rtracklayer_1.12.4 tools_2.13.1
[16] XML_3.4-0
>
More information about the Bioconductor
mailing list