[BioC] edgeR design matrix
lpascual
Laura.pascual at avignon.inra.fr
Mon Oct 17 13:52:13 CEST 2011
That works perfect, thanks so much Gordon and Mark
On 10/14/2011 09:27 PM, Mark Robinson wrote:
> Hi Laura,
>
> I think there was a minor typo there. The glmLRT() call should be:
>
> lrt<- glmLRT(d,fit,coef=2:8)
>
> Also, if you wish, you could create your factor with less keystrokes:
>
> cultivar<- factor(rep(1:8,each=2))
>
> Hope that helps,
> Mark
>
>
> On 14.10.2011, at 17:13, lpascual wrote:
>
>> Hi Gordon,
>>
>> I try it like you said, but I get a mistake, I do not know how to solve it??
>>
>> Thanks in advance
>>
>> Laura
>>
>>> cultivar<- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
>> + )
>>> design<-model.matrix(~cultivar)
>>> design
>> (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7
>> 1 1 0 0 0 0 0 0
>> 2 1 0 0 0 0 0 0
>> 3 1 1 0 0 0 0 0
>> 4 1 1 0 0 0 0 0
>> 5 1 0 1 0 0 0 0
>> 6 1 0 1 0 0 0 0
>> 7 1 0 0 1 0 0 0
>> 8 1 0 0 1 0 0 0
>> 9 1 0 0 0 1 0 0
>> 10 1 0 0 0 1 0 0
>> 11 1 0 0 0 0 1 0
>> 12 1 0 0 0 0 1 0
>> 13 1 0 0 0 0 0 1
>> 14 1 0 0 0 0 0 1
>> 15 1 0 0 0 0 0 0
>> 16 1 0 0 0 0 0 0
>> cultivar8
>> 1 0
>> 2 0
>> 3 0
>> 4 0
>> 5 0
>> 6 0
>> 7 0
>> 8 0
>> 9 0
>> 10 0
>> 11 0
>> 12 0
>> 13 0
>> 14 0
>> 15 1
>> 16 1
>> attr(,"assign")
>> [1] 0 1 1 1 1 1 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$cultivar
>> [1] "contr.treatment"
>>
>>> d<- estimateGLMCommonDisp(d, design)
>>> d
>> An object of class "DGEList"
>> $samples
>> files group description lib.size norm.factors
>> 1 s_1_CE_1_1counts.txt CER Cervil_CE 54428965 1
>> 2 s_2_CE_1_2counts.txt CER Cervil_CE 47946147 1
>> 3 s_6_CE_6_1counts.txt CRI Criollo_CE 49025258 1
>> 4 s_7_CE_6_2counts.txt CRI Criollo_CE 43322654 1
>> 5 s_1_CE_10_1counts.txt FER Ferum_CE 53574521 1
>> 11 more rows ...
>>
>> $counts
>> 1 2 3 4 5 6 7 8 9 10 11 12 13 14
>> GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273 97 208 256 345
>> GATCTCTGTCTCATCTTTCG 122 87 113 131 210 332 384 399 221 309 138 127 329 272
>> GATCTTCTTTTGAAGTAAAC 116 141 125 52 158 248 170 141 137 135 166 196 111 108
>> GATCAGCCAGGTCCAAGGCC 56 15 41 44 36 47 40 39 80 56 26 46 31 46
>> GATCCTCCGGAACCACAAGA 370 226 11 4 12 0 23 5 129 1 38 22 5 9
>> 15 16
>> GATCAAGAAAATAAAATGAA 154 69
>> GATCTCTGTCTCATCTTTCG 112 94
>> GATCTTCTTTTGAAGTAAAC 134 74
>> GATCAGCCAGGTCCAAGGCC 81 66
>> GATCCTCCGGAACCACAAGA 151 7
>> 73310 more rows ...
>>
>> $common.dispersion
>> [1] 0.1259328
>>
>>> fit<- glmFit(d,design,dispersion = d$common.dispersion)
>>> lrt<- glmLRT(d,fit,design,coef=2:8)
>> Error in glmfit$coefficients %*% contrast : non-conformable arguments
>>
>>
>>
>>
>> On 10/14/2011 07:00 AM, Gordon K Smyth wrote:
>>> Dear Laura,
>>>
>>>> Date: Wed, 12 Oct 2011 14:23:26 +0200
>>>> From: lpascual<Laura.pascual at avignon.inra.fr>
>>>> To: Bioconductor at r-project.org
>>>> Subject: [BioC] edgeR design matrix
>>>>
>>>> Hi all,
>>>>
>>>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype.
>>>>
>>>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix.
>>>>
>>>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar.
>>>>
>>>> To do that, it is possible to use one design matrix like these:
>>>>
>>>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8
>>>> 1-rep1 1 0 0 0 0 0 0 0
>>>> 1-rep2 1 0 0 0 0 0 0 0
>>>> 2-rep1 0 1 0 0 0 0 0 0
>>>> 2-rep2 0 1 0 0 0 0 0 0
>>>> 3-rep1 0 0 1 0 0 0 0 0
>>>> 3-rep2 0 0 1 0 0 0 0 0
>>>> 4-rep1 0 0 0 1 0 0 0 0
>>>> 4-rep2 0 0 0 1 0 0 0 0
>>>> 5-rep1 0 0 0 0 1 0 0 0
>>>> 5-rep2 0 0 0 0 1 0 0 0
>>>> 6-rep1 0 0 0 0 0 1 0 0
>>>> 6-rep2 0 0 0 0 0 1 0 0
>>>> 7-rep1 0 0 0 0 0 0 1 0
>>>> 7-rep2 0 0 0 0 0 0 1 0
>>>> 8-rep1 0 0 0 0 0 0 0 1
>>>> 8-rep2 0 0 0 0 0 0 0 1
>>>>
>>>> I have try it, but to calculate the null hypothesis the gmlLRT seams
>>>> just to remove the last column??
>>> You could use that design matrix, but it's inconvenient. If you want to find transcripts that are different between any of the cultivars, you would use:
>>>
>>> design<- model.matrix(~Cultivar)
>>> fit<- glmFit(y,design)
>>> lrt<- glmLRT(y,fit,design,coef=2:8)
>>> topTags(lrt)
>>>
>>> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars.
>>>
>>> Of course you have to estimate the dispersions as well, but I assume you know how to do that.
>>>
>>>
>>>> Besides I also want to test the differences between two parent lines and
>>>> the F1 that I have derived from them I have try a design matrix as follows,
>>>>
>>>>
>>>> Cultivar1 Cultivar2 F1
>>>> 1-rep1 1 0 0
>>>> 1-rep2 1 0 0
>>>> 2-rep1 0 1 0
>>>> 2-rep2 0 1 0
>>>> F1-rep1 0 0 1
>>>> F1-rep2 0 0 1
>>>>
>>>> But I again face the above mentioned problem.
>>> The above example should give you a clue as to how to do this.
>>>
>>>
>>>> Besides I was wondering if I can employ a design matrix where I will be
>>>> modeling how many alleles of each cultivar have my sample, doing
>>>> something like these:
>>>>
>>>> Cultivar1 Cultivar2
>>>> 1-rep1 2 0
>>>> 1-rep2 2 0
>>>> 2-rep1 0 2
>>>> 2-rep2 0 2
>>>> F1-rep1 0 1
>>>> F1-rep2 0 1
>>>>
>>>> I will appreciate any advice about the proper way to face these analysis.
>>> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices.
>>>
>>>
>>>> An just an extra question, in the manual it is recommended to eliminate
>>>> all the tags with low counts, as they are no going to be sadistically
>>>> significance and they will slow the process. First it is said that the
>>>> tags with less than 5 counts are not useful, but then all the tags with
>>>> lees than one tag per million (TPM) are removed, however in my case
>>>> having 50 million sequences that is all the tags with less than 50 counts.
>>>>
>>>> In fact when I do it my tags decrease a lot
>>>>
>>>>> dim(d)
>>>> [1] 2029802 16
>>>>
>>>>> d<-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
>>>> dim(d))>1)>= 2, ]
>>> You can just use cpm(d) to compute the counts-per-million.
>>>
>>>>> dim (d)
>>>> [1] 73310 16
>>>>
>>>> That is normal? Why did you choose to remove less than 1TPM?
>>> It depends on your library size. A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Thanks in advance
>>>> Laura Pascual
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:6}}
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list