[BioC] RNAseq machine learning classifier

Michael Love michaelisaiahlove at gmail.com
Thu Jul 18 11:44:43 CEST 2013


hi Jianping,

On Thu, Jul 18, 2013 at 1:15 AM,  <jhua at tgen.org> wrote:
> Hi, Mike:
>
> Thanks for the explanation.  They are really helpful.
>
> My concern is about the size factors estimation.  I'm not familiar with the details of VST so my understanding might be wrong.
>
> My understanding is that the VST is applied on the count normalized data, i.e., size factors must be estimated from the data.  I believe that they are also data-dependent.  So I'm wondering when one use the frozen parameters for the future data, how are the size factors being computed?  Does DESeq2 use the loggeomeans of the future data to estimate the size factor, or does it have them in the frozen parameters for reuse? Or does the choice matter to VST?


Now I see your point. You are correct that the size factors would be
computed using the log geometric means of the new data.

If you want to generate size factors for a new dataset that are
commensurate with the old dataset, you could do:

allCounts <- cbind(counts(ddsNew), counts(ddsOld))
allSF <- estimateSizeFactorsForMatrix(allCounts)
sizeFactors(ddsNew) <- allSF[1:ncol(ddsNew)]


>
> And in the case I encountered, somehow VST has little effects to the data (there might be a fitness problem to the model for our data).  So we decided that we just stick to count normalized data by counts(ads, normalized = TRUE).  Hence which loggeomeans to use does matter.  And for the future data, which is usually a small testing set, we plan to use the loggeomeans of our large training data to calculate the size factors.
>


Are you then taking the log plus a pseudocount of the
size-factor-normalized data? It might be good to examine with and
without VST using meanSdPlot as we have in the section "Effects of
transformations on the variance" in the vignette.  If you have
elevated variance at low counts without using the VST, this could be
detrimental to the performance of a classifier.

Mike


>
> Jianping
>
>
>
> On Jul 17, 2013, at 2:54 PM, Michael Love wrote:
>
>> hi Jianping,
>>
>> I thought the discussion was about normalization using VST, I'm not
>> sure what is meant by normalization otherwise.
>>
>> by the way, the VST parameters can be re-assigned in DESeq2 like so:
>>
>> dispersionFunction(ddsNew) <- dispersionFunction(ddsOld)
>>
>> The call to the transformation should then specify blind=FALSE, so as
>> to bypass the internal re-estimation of dispersions.  At the moment,
>> you also need to have some dispersions estimated for ddsNew (or to set
>> the dispersions to any numeric values), to avoid re-estimation of
>> dispersion internally, although I will fix this so that the VST only
>> checks for an existing dispersion function.
>>
>> Mike
>>
>> On Wed, Jul 17, 2013 at 9:03 PM,  <jhua at tgen.org> wrote:
>>> Hi, Wolfgang:
>>>
>>> Thanks for pointing this out.  This sounds really convenient.  I'll definitely check it out on how to freeze the parameters.
>>>
>>> Also how about normalization?  Is there a similar procedure that I can freeze the profile for future sample normalization?  Right now I do it by my own simple routine.  But it would be wonderful if this can be done internally. Thanks.
>>>
>>>
>>> Jianping
>>>
>>>
>>>
>>> On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote:
>>>
>>>> Hi Jianping
>>>>
>>>> good point about the parameter-dependence (i.e. dataset-dependence) of the variance stabilising transformations (VST) in DESeq2.
>>>> However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples.
>>>>
>>>> As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed.
>>>>
>>>>      Best wishes
>>>>      Wolfgang
>>>>
>>>> On 17 Jul 2013, at 20:16, <jhua at tgen.org> wrote:
>>>>
>>>>> This sounds an OK approach to me.
>>>>>
>>>>> One thing you might take into consideration is that the classifier design usually involves independent validation data.  If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort.  But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing...
>>>>>
>>>>> Jianping Hua, Ph. D.
>>>>> Research Assistant Professor
>>>>> Computational Biology Division
>>>>> Translational Genomics Research Institute (TGen)
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Steve!
>>>>>>
>>>>>> I was thinking along these same lines: estimating dispersions then using a
>>>>>> variance stabilizing transformation. However, I am not sure how proper this
>>>>>> approach is?
>>>>>>
>>>>>> Can anyone confirm the validity of this approach?
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou
>>>>>> <lianoglou.steve at gene.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen
>>>>>>> <breenbioinformatics at gmail.com> wrote:
>>>>>>>> Hi all,
>>>>>>>> We have a large RNAseq data set. Apart from identifying differentially
>>>>>>>> expressed genes with these data we are also interested in classification
>>>>>>> in
>>>>>>>> terms of developing a pronostic and diagnostic classifier.
>>>>>>>>
>>>>>>>> Normally, our approach would utilize a machine learning classifier, as
>>>>>>> SVM,
>>>>>>>> and typically proceed with a nested cross-validation approach.
>>>>>>>>
>>>>>>>>
>>>>>>>> The vast majority of these programs and packages have been designed
>>>>>>>> utilizing microarray data.
>>>>>>>>
>>>>>>>> Are there any reasonable biases which one should consider before using
>>>>>>> such
>>>>>>>> already published approaches on RNAseq data?
>>>>>>>>
>>>>>>>> Do the distributions of the different data types matter at all?
>>>>>>>>
>>>>>>>> If so, does an application exist using an SVM taking into consideration
>>>>>>>> RNAseq raw counts?
>>>>>>>
>>>>>>> One approach would be to take the output from one of the variance
>>>>>>> stabilizing transformations in DESeq2 as the input to your machine
>>>>>>> learning approach.
>>>>>>>
>>>>>>> See:
>>>>>>>
>>>>>>> R> library(DESeq2)
>>>>>>> R> ?varianceStabilizingTransformation
>>>>>>>
>>>>>>> and the Section 7 of the DESeq2 vignette (count data transformations):
>>>>>>>
>>>>>>>
>>>>>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
>>>>>>>
>>>>>>> HTH,
>>>>>>> -steve
>>>>>>>
>>>>>>> --
>>>>>>> Steve Lianoglou
>>>>>>> Computational Biologist
>>>>>>> Bioinformatics and Computational Biology
>>>>>>> Genentech
>>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list