[BioC] Help With RNA-seq
Thomas J Hardcastle
tjh48 at cam.ac.uk
Wed Feb 1 18:20:51 CET 2012
Well, this is embarrassing...
A bug in the code for getPriors.Pois meant that it tripped into an
infinite loop. I've posted a fix to the release track of Bioconductor;
version 1.8.2 should allow getPriors.Pois to work. I suppose that nobody
finding this before means that most people are using the negative
binomial approach - as they should be!
Tom
On 01/02/12 16:50, Valerie Obenchain wrote:
> Thinking about this a bit more ...
>
> Have you tried modifying the 'maxit' argument to getPriors.Pois() ?
> It's possible this threshold could be reduced; it would also confirm
> that the algorithm is converging (if you reduced it to a point where
> you saw an error).
>
> If I understand correctly, you are not seeing any error messages but
> getPirors.Pois() and getPrior.NB() are taking a long time to run. (fyi
> per your comment below, CDP.Poi at priors is just accessing the data in
> the slot of the object; it is not a function). Without having your
> data to test on it is difficult to know what is going on. It would be
> useful to know what kind of machine you are working on, how may cpus,
> how much memory etc.
>
> I'm cc'ing Thomas (package author) in case he has ideas.
>
> Valerie
>
>
> On 01/31/2012 07:41 PM, Valerie Obenchain wrote:
>> Hi Tina,
>>
>> I'm pasting in your message below so we can keep all communication on
>> the mailing list.
>>
>> The 'samplesize' argument looks wrong in your call to compute the
>> priors.
>>
>> CDP.Poi<- getPriors.Pois(CD, samplesize = 2, takemean = TRUE, cl
>> = cl)
>>
>> Why have you set this to 2? It should be a much larger number. Try
>> using the default (1e5),
>>
>> CDP.Poi<- getPriors.Pois(CD, cl = cl)
>>
>> Does this speed it up?
>>
>> Valerie
>>
>> ##
>> -----------------------------------------------------------------------
>>
>> Hi Valerie,
>>
>> Thank you once again for all your help.
>>
>> As you requested in the previous email for a clearer explanation to
>> the problem I am encountering at present.
>>
>> head(D)
>> R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver R1L7Kidney
>> R1L8Liver
>> 10 0 0 0 0 0
>> 2 1
>> 15 4 35 7 32 31
>> 3 29
>> 17 0 2 0 0 0
>> 1 0
>> 18 110 177 131 135 141
>> 149 148
>> 19 12685 9246 13204 9312 8746
>> 12403 8496
>> 22 0 1 0 0 0
>> 0 0
>> R2L2Kidney R2L3Liver R2L6Kidney
>> 10 4 0 3
>> 15 6 34 7
>> 17 1 0 0
>> 18 112 145 118
>> 19 13031 9070 13268
>> 22 1 0 0
>>
>> names(D)
>> [1] "R1L1Kidney" "R1L2Liver" "R1L3Kidney" "R1L4Liver" "R1L6Liver"
>> [6] "R1L7Kidney" "R1L8Liver" "R2L2Kidney" "R2L3Liver" "R2L6Kidney"
>>
>> dim(D)
>> [1] 22490 10
>>
>> g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>
>>> g
>> [1] "Kidney" "Liver" "Kidney" "Liver" "Liver" "Kidney" "Liver"
>> "Kidney"
>> [9] "Liver" "Kidney"
>>
>>
>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>> Calculating library sizes from column totals.
>>
>>> d
>> An object of class "DGEList"
>> $samples
>> group lib.size norm.factors
>> R1L1Kidney Kidney 1804977 1
>> R1L2Liver Liver 1691734 1
>> R1L3Kidney Kidney 1855190 1
>> R1L4Liver Liver 1696308 1
>> R1L6Liver Liver 1630795 1
>> R1L7Kidney Kidney 1742426 1
>> R1L8Liver Liver 1575425 1
>> R2L2Kidney Kidney 1927517 1
>> R2L3Liver Liver 1767339 1
>> R2L6Kidney Kidney 1963420 1
>>
>> $counts
>> R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver R1L7Kidney
>> R1L8Liver
>> 10 0 0 0 0 0
>> 2 1
>> 15 4 35 7 32 31
>> 3 29
>> 17 0 2 0 0 0
>> 1 0
>> 18 110 177 131 135 141
>> 149 148
>> 19 12685 9246 13204 9312 8746
>> 12403 8496
>> R2L2Kidney R2L3Liver R2L6Kidney
>> 10 4 0 3
>> 15 6 34 7
>> 17 1 0 0
>> 18 112 145 118
>> 19 13031 9070 13268
>> 22485 more rows ...
>>
>> $all.zeros
>> 10 15 17 18 19
>> FALSE FALSE FALSE FALSE FALSE
>> 22485 more elements ...
>>
>> d$samples
>> group lib.size norm.factors
>> R1L1Kidney Kidney 1804977 1
>> R1L2Liver Liver 1691734 1
>> R1L3Kidney Kidney 1855190 1
>> R1L4Liver Liver 1696308 1
>> R1L6Liver Liver 1630795 1
>> R1L7Kidney Kidney 1742426 1
>> R1L8Liver Liver 1575425 1
>> R2L2Kidney Kidney 1927517 1
>> R2L3Liver Liver 1767339 1
>> R2L6Kidney Kidney 1963420 1
>>
>>> names(d)
>> [1] "samples" "counts" "all.zeros"
>>
>>
>>> dim(d)
>> [1] 22490 10
>>
>> Yes the plot does work .
>>
>> However, it is the call to CDP.Poi at priors that is taking that long.
>>
>> With regards to the rest of the code it will follow the same approach
>> with as the one above with slight changes.
>>
>> Also the call to getPriors.NB() is also taking very long as well.
>>
>> These two calls are in essence the main contributions if the rest of
>> the code is to work.
>>
>> Thank you so much for your help.
>>
>> Have a pleasant day.
>>
>>
>> ##
>> -----------------------------------------------------------------------
>>
>>
>>
>>
>>
>> On 01/29/12 20:54, Valerie Obenchain wrote:
>>> On 01/29/12 07:49, Tina Asante Boahene wrote:
>>>> Hi all,
>>>>
>>>> I am still having problems with bayseq
>>>>
>>>> Having followed the pdf document associated with it and also
>>>> tailoring it to the Marioni et al data I am using, it seems that
>>>> the code has been running for over two days without any results.
>>>>
>>>> I am wondering this code be down to my code.
>>>>
>>>> I have therefore attached my code to this email hoping that someone
>>>> can help me solve this problem, thank you.
>>>>
>>>>
>>>> library(baySeq)
>>>> library(edgeR)
>>>> library(limma)
>>>> library(snow)
>>>>
>>>> cl<- makeCluster(4, "SOCK")
>>>>
>>>>
>>>> ##Calculating normalization factors##
>>>> D=MA.subsetA$M
>>>> head(D)
>>>> names(D)
>>>> dim(D)
>>> Please provide the output of these (i.e., head, names, dim).
>>>>
>>>> g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>>>> d$samples
>>>> names(d)
>>>> dim(d)
>>> These values would be helpful too.
>>>>
>>>>
>>>> CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes =
>>>> as.integer(d$samples$lib.size), replicates = g)
>>>> groups(CD)<- list(rep(1, ncol(CD)), g)
>>>>
>>>> CD at libsizes<- getLibsizes(CD)
>>>>
>>>> plotMA.CD(CD, samplesA = c(1,3,6,8,10), samplesB = c(2,4,5,7,9),
>>>> col = c(rep("red",
>>>> 100), rep("black", 900)))
>>>
>>> Did this work? Your original question was about plotMA.CD not
>>> recognizing your groups. Does the plot work for you now?
>>>>
>>>> ## Optionally adding annotation details to the @annotation slot of
>>>> the countData object. ##
>>>> CD at annotation<- data.frame(name = paste("gene", 1:1000, sep = "_"))
>>>>
>>>>
>>>>
>>>>
>>>> ### Poisson-Gamma Approach ###
>>>>
>>>> CDP.Poi<- getPriors.Pois(CD, samplesize = 2, takemean = TRUE, cl = cl)
>>>>
>>>> CDP.Poi at priors ## This takes time ###
>>> Is the call to getPriors.Pois() that has been running for over 2
>>> days? If not, please specify which function call is taking so long.
>>> Did the rest of the code below work for you?
>>>
>>> Valerie
>>>>
>>>> CDPost.Poi<- getLikelihoods.Pois(CDP.Poi, pET = "BIC", cl = cl)
>>>> CDPost.Poi at estProps
>>>>
>>>> CDPost.Poi at posteriors[1:10, ] ## A list of the posterior
>>>> likelihoods each model for the first 10 genes ##
>>>> CDPost.Poi at posteriors[101:110, ] ## A list of the posterior
>>>> likelihoods each model for the genes from 101 to 110 ##
>>>>
>>>>
>>>> ### Negative-Binomial Approach ###
>>>>
>>>> CDP.NBML<- getPriors.NB(CD, samplesize = 1000, estimation = "QL",
>>>> cl = cl)
>>>>
>>>> CDPost.NBML<- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)
>>>>
>>>> CDPost.NBML at estProps
>>>>
>>>> CDPost.NBML at posteriors[1:10,]
>>>>
>>>> CDPost.NBML at posteriors[101:110,]
>>>>
>>>>
>>>> Kind Regards
>>>>
>>>> Tina
>>>> ________________________________________
>>>> From: Valerie Obenchain [vobencha at fhcrc.org]
>>>> Sent: 25 January 2012 18:14
>>>> To: Tina Asante Boahene
>>>> Cc: bioconductor at stat.math.ethz.ch
>>>> Subject: Re: [BioC] Help With RNA-seq
>>>>
>>>> Hi Tina,
>>>>
>>>> It's difficult to help without knowing what your data look like or
>>>> what
>>>> error message you are seeing. Both pieces of information would be
>>>> helpful.
>>>>
>>>> For starters I think you need to provide 'replicate' and 'groups'
>>>> arguments when you create your new "countData" object. Depending on
>>>> what
>>>> order your data are in you need something like,
>>>>
>>>> groups<- list(NDE = c(1,1,1,1,1,1,1,1,1,1), DE =
>>>> c(1,2,1,2,2,1,2,1,2,1))
>>>> replicates<- c("Kidney", "Liver", "Kidney", "Liver", "Liver',
>>>> "Kidney", "Liver", "Kidney", "Liver", "Kidney")
>>>>
>>>> Then create your "countData" with these variables,
>>>>
>>>> CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes =
>>>> as.integer(d$samples$lib.size),
>>>> replicates = replicates, groups = groups)
>>>>
>>>> Now look at the CD object and make sure the columns are labeled as
>>>> they
>>>> should be and the other slot values make sense. The MA plot call would
>>>> look something like,
>>>>
>>>> plotMA.CD(CD, samplesA = "Kidney", samplesB = "Liver")
>>>>
>>>> The author used the red and black colors for the vignette plot because
>>>> there was a known structure to the data; the first 100 counts showed
>>>> differential expression and the last 900 did not. You probably have a
>>>> different situation in your data so using the same color scheme may
>>>> not
>>>> make sense.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 01/23/2012 06:13 AM, Tina Asante Boahene wrote:
>>>>> Hi all,
>>>>>
>>>>> I am conducting some analysis using the Marioni et al data.
>>>>>
>>>>> However, I am having a bit of trouble using my data to conduct the
>>>>> analysis based on the baySeq package.
>>>>>
>>>>> And I was wondering if you could stir me in the right direction.
>>>>>
>>>>> I have already used edgeR to find the library sizes for the ten
>>>>> libraries I have for my data as well as for the groups (Liver and
>>>>> Kidney) as stated below.
>>>>>
>>>>>
>>>>> library(baySeq)
>>>>> library(edgeR)
>>>>> library(limma)
>>>>> library(snow)
>>>>>
>>>>> cl<- makeCluster(4, "SOCK")
>>>>>
>>>>>
>>>>> ##Calculating normalization factors##
>>>>> D=MA.subsetA$M
>>>>> head(D)
>>>>> names(D)
>>>>> dim(D)
>>>>>
>>>>> g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>>>>> d$samples
>>>>> names(d)
>>>>> dim(d)
>>>>>
>>>>>
>>>>> I will like to know how to model my code in order to produce the
>>>>> MA plot for count data
>>>>>
>>>>>
>>>>> This is what I have, however it runs with the wrong response.
>>>>>
>>>>> Can someone help me fix this please.
>>>>>
>>>>> CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes =
>>>>> as.integer(d$samples$lib.size))
>>>>>
>>>>> plotMA.CD(CD, samplesA = 1:5, samplesB = 6:10, col = c(rep("red",
>>>>> 100), rep("black", 900)))
>>>>>
>>>>>
>>>>> How can I get it to recognise the "groups" as "g" (Library and
>>>>> Kidney)
>>>>>
>>>>> This is the output for the groups [1] "Kidney" "Liver" "Kidney"
>>>>> "Liver" "Liver" "Kidney" "Liver" "Kidney" "Liver" "Kidney"
>>>>>
>>>>> thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Kind Regards
>>>>>
>>>>> Tina
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Dr. Thomas J. Hardcastle
Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom
More information about the Bioconductor
mailing list