[BioC] edgeR design matrix
Mark Robinson
mark.robinson at imls.uzh.ch
Fri Oct 14 21:27:31 CEST 2011
Hi Laura,
I think there was a minor typo there. The glmLRT() call should be:
lrt <- glmLRT(d,fit,coef=2:8)
Also, if you wish, you could create your factor with less keystrokes:
cultivar <- factor(rep(1:8,each=2))
Hope that helps,
Mark
On 14.10.2011, at 17:13, lpascual wrote:
> Hi Gordon,
>
> I try it like you said, but I get a mistake, I do not know how to solve it??
>
> Thanks in advance
>
> Laura
>
> > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
> + )
> > design <-model.matrix(~cultivar)
> > design
> (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7
> 1 1 0 0 0 0 0 0
> 2 1 0 0 0 0 0 0
> 3 1 1 0 0 0 0 0
> 4 1 1 0 0 0 0 0
> 5 1 0 1 0 0 0 0
> 6 1 0 1 0 0 0 0
> 7 1 0 0 1 0 0 0
> 8 1 0 0 1 0 0 0
> 9 1 0 0 0 1 0 0
> 10 1 0 0 0 1 0 0
> 11 1 0 0 0 0 1 0
> 12 1 0 0 0 0 1 0
> 13 1 0 0 0 0 0 1
> 14 1 0 0 0 0 0 1
> 15 1 0 0 0 0 0 0
> 16 1 0 0 0 0 0 0
> cultivar8
> 1 0
> 2 0
> 3 0
> 4 0
> 5 0
> 6 0
> 7 0
> 8 0
> 9 0
> 10 0
> 11 0
> 12 0
> 13 0
> 14 0
> 15 1
> 16 1
> attr(,"assign")
> [1] 0 1 1 1 1 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$cultivar
> [1] "contr.treatment"
>
> > d <- estimateGLMCommonDisp(d, design)
> > d
> An object of class "DGEList"
> $samples
> files group description lib.size norm.factors
> 1 s_1_CE_1_1counts.txt CER Cervil_CE 54428965 1
> 2 s_2_CE_1_2counts.txt CER Cervil_CE 47946147 1
> 3 s_6_CE_6_1counts.txt CRI Criollo_CE 49025258 1
> 4 s_7_CE_6_2counts.txt CRI Criollo_CE 43322654 1
> 5 s_1_CE_10_1counts.txt FER Ferum_CE 53574521 1
> 11 more rows ...
>
> $counts
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14
> GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273 97 208 256 345
> GATCTCTGTCTCATCTTTCG 122 87 113 131 210 332 384 399 221 309 138 127 329 272
> GATCTTCTTTTGAAGTAAAC 116 141 125 52 158 248 170 141 137 135 166 196 111 108
> GATCAGCCAGGTCCAAGGCC 56 15 41 44 36 47 40 39 80 56 26 46 31 46
> GATCCTCCGGAACCACAAGA 370 226 11 4 12 0 23 5 129 1 38 22 5 9
> 15 16
> GATCAAGAAAATAAAATGAA 154 69
> GATCTCTGTCTCATCTTTCG 112 94
> GATCTTCTTTTGAAGTAAAC 134 74
> GATCAGCCAGGTCCAAGGCC 81 66
> GATCCTCCGGAACCACAAGA 151 7
> 73310 more rows ...
>
> $common.dispersion
> [1] 0.1259328
>
> > fit <- glmFit(d,design,dispersion = d$common.dispersion)
> > lrt <- glmLRT(d,fit,design,coef=2:8)
> Error in glmfit$coefficients %*% contrast : non-conformable arguments
>
>
>
>
> On 10/14/2011 07:00 AM, Gordon K Smyth wrote:
>> Dear Laura,
>>
>>> Date: Wed, 12 Oct 2011 14:23:26 +0200
>>> From: lpascual <Laura.pascual at avignon.inra.fr>
>>> To: Bioconductor at r-project.org
>>> Subject: [BioC] edgeR design matrix
>>>
>>> Hi all,
>>>
>>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype.
>>>
>>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix.
>>>
>>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar.
>>>
>>> To do that, it is possible to use one design matrix like these:
>>>
>>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8
>>> 1-rep1 1 0 0 0 0 0 0 0
>>> 1-rep2 1 0 0 0 0 0 0 0
>>> 2-rep1 0 1 0 0 0 0 0 0
>>> 2-rep2 0 1 0 0 0 0 0 0
>>> 3-rep1 0 0 1 0 0 0 0 0
>>> 3-rep2 0 0 1 0 0 0 0 0
>>> 4-rep1 0 0 0 1 0 0 0 0
>>> 4-rep2 0 0 0 1 0 0 0 0
>>> 5-rep1 0 0 0 0 1 0 0 0
>>> 5-rep2 0 0 0 0 1 0 0 0
>>> 6-rep1 0 0 0 0 0 1 0 0
>>> 6-rep2 0 0 0 0 0 1 0 0
>>> 7-rep1 0 0 0 0 0 0 1 0
>>> 7-rep2 0 0 0 0 0 0 1 0
>>> 8-rep1 0 0 0 0 0 0 0 1
>>> 8-rep2 0 0 0 0 0 0 0 1
>>>
>>> I have try it, but to calculate the null hypothesis the gmlLRT seams
>>> just to remove the last column??
>>
>> You could use that design matrix, but it's inconvenient. If you want to find transcripts that are different between any of the cultivars, you would use:
>>
>> design <- model.matrix(~Cultivar)
>> fit <- glmFit(y,design)
>> lrt <- glmLRT(y,fit,design,coef=2:8)
>> topTags(lrt)
>>
>> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars.
>>
>> Of course you have to estimate the dispersions as well, but I assume you know how to do that.
>>
>>
>>> Besides I also want to test the differences between two parent lines and
>>> the F1 that I have derived from them I have try a design matrix as follows,
>>>
>>>
>>> Cultivar1 Cultivar2 F1
>>> 1-rep1 1 0 0
>>> 1-rep2 1 0 0
>>> 2-rep1 0 1 0
>>> 2-rep2 0 1 0
>>> F1-rep1 0 0 1
>>> F1-rep2 0 0 1
>>>
>>> But I again face the above mentioned problem.
>>
>> The above example should give you a clue as to how to do this.
>>
>>
>>> Besides I was wondering if I can employ a design matrix where I will be
>>> modeling how many alleles of each cultivar have my sample, doing
>>> something like these:
>>>
>>> Cultivar1 Cultivar2
>>> 1-rep1 2 0
>>> 1-rep2 2 0
>>> 2-rep1 0 2
>>> 2-rep2 0 2
>>> F1-rep1 0 1
>>> F1-rep2 0 1
>>>
>>> I will appreciate any advice about the proper way to face these analysis.
>>
>> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices.
>>
>>
>>> An just an extra question, in the manual it is recommended to eliminate
>>> all the tags with low counts, as they are no going to be sadistically
>>> significance and they will slow the process. First it is said that the
>>> tags with less than 5 counts are not useful, but then all the tags with
>>> lees than one tag per million (TPM) are removed, however in my case
>>> having 50 million sequences that is all the tags with less than 50 counts.
>>>
>>> In fact when I do it my tags decrease a lot
>>>
>>> > dim(d)
>>> [1] 2029802 16
>>>
>>> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
>>> dim(d)) >1) >= 2, ]
>>
>> You can just use cpm(d) to compute the counts-per-million.
>>
>>> > dim (d)
>>>
>>> [1] 73310 16
>>>
>>> That is normal? Why did you choose to remove less than 1TPM?
>>
>> It depends on your library size. A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff.
>>
>> Best wishes
>> Gordon
>>
>>> Thanks in advance
>>> Laura Pascual
>>
>> ______________________________________________________________________
>> The information in this email is confidential and inte...{{dropped:6}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list