[BioC] edgeR design matrix

Mon Oct 17 13:52:13 CEST 2011

That works perfect, thanks so much Gordon and Mark

On 10/14/2011 09:27 PM, Mark Robinson wrote:
> Hi Laura,
>
> I think there was a minor typo there.  The glmLRT() call should be:
>
>   lrt<- glmLRT(d,fit,coef=2:8)
>
> Also, if you wish, you could create your factor with less keystrokes:
>
>   cultivar<- factor(rep(1:8,each=2))
>
> Hope that helps,
> Mark
>
>
> On 14.10.2011, at 17:13, lpascual wrote:
>
>> Hi Gordon,
>>
>> I try it like you said, but I get a mistake, I do not know how to solve it??
>>
>> Thanks in advance
>>
>> Laura
>>
>>> cultivar<- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
>> + )
>>> design<-model.matrix(~cultivar)
>>> design
>>    (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7
>> 1            1         0         0         0         0         0         0
>> 2            1         0         0         0         0         0         0
>> 3            1         1         0         0         0         0         0
>> 4            1         1         0         0         0         0         0
>> 5            1         0         1         0         0         0         0
>> 6            1         0         1         0         0         0         0
>> 7            1         0         0         1         0         0         0
>> 8            1         0         0         1         0         0         0
>> 9            1         0         0         0         1         0         0
>> 10           1         0         0         0         1         0         0
>> 11           1         0         0         0         0         1         0
>> 12           1         0         0         0         0         1         0
>> 13           1         0         0         0         0         0         1
>> 14           1         0         0         0         0         0         1
>> 15           1         0         0         0         0         0         0
>> 16           1         0         0         0         0         0         0
>>    cultivar8
>> 1          0
>> 2          0
>> 3          0
>> 4          0
>> 5          0
>> 6          0
>> 7          0
>> 8          0
>> 9          0
>> 10         0
>> 11         0
>> 12         0
>> 13         0
>> 14         0
>> 15         1
>> 16         1
>> attr(,"assign")
>> [1] 0 1 1 1 1 1 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$cultivar
>> [1] "contr.treatment"
>>
>>> d<- estimateGLMCommonDisp(d, design)
>>> d
>> An object of class "DGEList"
>> $samples
>>                   files group description lib.size norm.factors
>> 1  s_1_CE_1_1counts.txt   CER   Cervil_CE 54428965            1
>> 2  s_2_CE_1_2counts.txt   CER   Cervil_CE 47946147            1
>> 3  s_6_CE_6_1counts.txt   CRI  Criollo_CE 49025258            1
>> 4  s_7_CE_6_2counts.txt   CRI  Criollo_CE 43322654            1
>> 5 s_1_CE_10_1counts.txt   FER    Ferum_CE 53574521            1
>> 11 more rows ...
>>
>> $counts
>>                        1   2   3   4   5   6   7   8   9  10  11  12  13  14
>> GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273  97 208 256 345
>> GATCTCTGTCTCATCTTTCG 122  87 113 131 210 332 384 399 221 309 138 127 329 272
>> GATCTTCTTTTGAAGTAAAC 116 141 125  52 158 248 170 141 137 135 166 196 111 108
>> GATCAGCCAGGTCCAAGGCC  56  15  41  44  36  47  40  39  80  56  26  46  31  46
>> GATCCTCCGGAACCACAAGA 370 226  11   4  12   0  23   5 129   1  38  22   5   9
>>                       15 16
>> GATCAAGAAAATAAAATGAA 154 69
>> GATCTCTGTCTCATCTTTCG 112 94
>> GATCTTCTTTTGAAGTAAAC 134 74
>> GATCAGCCAGGTCCAAGGCC  81 66
>> GATCCTCCGGAACCACAAGA 151  7
>> 73310 more rows ...
>>
>> $common.dispersion
>> [1] 0.1259328
>>
>>> fit<- glmFit(d,design,dispersion = d$common.dispersion)
>>> lrt<- glmLRT(d,fit,design,coef=2:8)
>> Error in glmfit$coefficients %*% contrast : non-conformable arguments
>>
>>
>>
>>
>> On 10/14/2011 07:00 AM, Gordon K Smyth wrote:
>>> Dear Laura,
>>>
>>>> Date: Wed, 12 Oct 2011 14:23:26 +0200
>>>> From: lpascual<Laura.pascual at avignon.inra.fr>
>>>> To: Bioconductor at r-project.org
>>>> Subject: [BioC] edgeR design matrix
>>>>
>>>> Hi all,
>>>>
>>>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype.
>>>>
>>>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix.
>>>>
>>>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar.
>>>>
>>>> To do that, it is possible to use one design matrix like these:
>>>>
>>>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8
>>>> 1-rep1 1 0 0 0 0 0 0 0
>>>> 1-rep2 1 0 0 0 0 0 0 0
>>>> 2-rep1 0 1 0 0 0 0 0 0
>>>> 2-rep2 0 1 0 0 0 0 0 0
>>>> 3-rep1 0 0 1 0 0 0 0 0
>>>> 3-rep2 0 0 1 0 0 0 0 0
>>>> 4-rep1 0 0 0 1 0 0 0 0
>>>> 4-rep2 0 0 0 1 0 0 0 0
>>>> 5-rep1 0 0 0 0 1 0 0 0
>>>> 5-rep2 0 0 0 0 1 0 0 0
>>>> 6-rep1 0 0 0 0 0 1 0 0
>>>> 6-rep2 0 0 0 0 0 1 0 0
>>>> 7-rep1 0 0 0 0 0 0 1 0
>>>> 7-rep2 0 0 0 0 0 0 1 0
>>>> 8-rep1 0 0 0 0 0 0 0 1
>>>> 8-rep2 0 0 0 0 0 0 0 1
>>>>
>>>> I have try it, but to calculate the null hypothesis the gmlLRT seams
>>>> just to remove the last column??
>>> You could use that design matrix, but it's inconvenient.  If you want to find transcripts that are different between any of the cultivars, you would use:
>>>
>>>   design<- model.matrix(~Cultivar)
>>>   fit<- glmFit(y,design)
>>>   lrt<- glmLRT(y,fit,design,coef=2:8)
>>>   topTags(lrt)
>>>
>>> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars.
>>>
>>> Of course you have to estimate the dispersions as well, but I assume you know how to do that.
>>>
>>>
>>>> Besides I also want to test the differences between two parent lines and
>>>> the F1 that I have derived from them I have try a design matrix as follows,
>>>>
>>>>
>>>> Cultivar1 Cultivar2 F1
>>>> 1-rep1 1 0 0
>>>> 1-rep2 1 0 0
>>>> 2-rep1 0 1 0
>>>> 2-rep2 0 1 0
>>>> F1-rep1 0 0 1
>>>> F1-rep2 0 0 1
>>>>
>>>> But I again face the above mentioned problem.
>>> The above example should give you a clue as to how to do this.
>>>
>>>
>>>> Besides I was wondering if I can employ a design matrix where I will be
>>>> modeling how many alleles of each cultivar have my sample, doing
>>>> something like these:
>>>>
>>>> Cultivar1 Cultivar2
>>>> 1-rep1 2 0
>>>> 1-rep2 2 0
>>>> 2-rep1 0 2
>>>> 2-rep2 0 2
>>>> F1-rep1 0 1
>>>> F1-rep2 0 1
>>>>
>>>> I will appreciate any advice about the proper way to face these analysis.
>>> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices.
>>>
>>>
>>>> An just an extra question, in the manual it is recommended to eliminate
>>>> all the tags with low counts, as they are no going to be sadistically
>>>> significance and they will slow the process. First it is said that the
>>>> tags with less than 5 counts are not useful, but then all the tags with
>>>> lees than one tag per million (TPM) are removed, however in my case
>>>> having 50 million sequences that is all the tags with less than 50 counts.
>>>>
>>>> In fact when I do it my tags decrease a lot
>>>>
>>>>> dim(d)
>>>> [1] 2029802 16
>>>>
>>>>> d<-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
>>>> dim(d))>1)>= 2, ]
>>> You can just use cpm(d) to compute the counts-per-million.
>>>
>>>>> dim (d)
>>>> [1] 73310 16
>>>>
>>>> That is normal? Why did you choose to remove less than 1TPM?
>>> It depends on your library size.  A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Thanks in advance
>>>> Laura Pascual
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:6}}
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor