[BioC] edgeR design matrix

Fri Oct 14 21:27:31 CEST 2011

Hi Laura,

I think there was a minor typo there.  The glmLRT() call should be:

 lrt <- glmLRT(d,fit,coef=2:8)

Also, if you wish, you could create your factor with less keystrokes:

 cultivar <- factor(rep(1:8,each=2))

Hope that helps,
Mark

On 14.10.2011, at 17:13, lpascual wrote:

> Hi Gordon,
> 
> I try it like you said, but I get a mistake, I do not know how to solve it??
> 
> Thanks in advance
> 
> Laura
> 
> > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
> + )
> > design <-model.matrix(~cultivar)
> > design
>   (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7
> 1            1         0         0         0         0         0         0
> 2            1         0         0         0         0         0         0
> 3            1         1         0         0         0         0         0
> 4            1         1         0         0         0         0         0
> 5            1         0         1         0         0         0         0
> 6            1         0         1         0         0         0         0
> 7            1         0         0         1         0         0         0
> 8            1         0         0         1         0         0         0
> 9            1         0         0         0         1         0         0
> 10           1         0         0         0         1         0         0
> 11           1         0         0         0         0         1         0
> 12           1         0         0         0         0         1         0
> 13           1         0         0         0         0         0         1
> 14           1         0         0         0         0         0         1
> 15           1         0         0         0         0         0         0
> 16           1         0         0         0         0         0         0
>   cultivar8
> 1          0
> 2          0
> 3          0
> 4          0
> 5          0
> 6          0
> 7          0
> 8          0
> 9          0
> 10         0
> 11         0
> 12         0
> 13         0
> 14         0
> 15         1
> 16         1
> attr(,"assign")
> [1] 0 1 1 1 1 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$cultivar
> [1] "contr.treatment"
> 
> > d <- estimateGLMCommonDisp(d, design)
> > d
> An object of class "DGEList"
> $samples
>                  files group description lib.size norm.factors
> 1  s_1_CE_1_1counts.txt   CER   Cervil_CE 54428965            1
> 2  s_2_CE_1_2counts.txt   CER   Cervil_CE 47946147            1
> 3  s_6_CE_6_1counts.txt   CRI  Criollo_CE 49025258            1
> 4  s_7_CE_6_2counts.txt   CRI  Criollo_CE 43322654            1
> 5 s_1_CE_10_1counts.txt   FER    Ferum_CE 53574521            1
> 11 more rows ...
> 
> $counts
>                       1   2   3   4   5   6   7   8   9  10  11  12  13  14
> GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273  97 208 256 345
> GATCTCTGTCTCATCTTTCG 122  87 113 131 210 332 384 399 221 309 138 127 329 272
> GATCTTCTTTTGAAGTAAAC 116 141 125  52 158 248 170 141 137 135 166 196 111 108
> GATCAGCCAGGTCCAAGGCC  56  15  41  44  36  47  40  39  80  56  26  46  31  46
> GATCCTCCGGAACCACAAGA 370 226  11   4  12   0  23   5 129   1  38  22   5   9
>                      15 16
> GATCAAGAAAATAAAATGAA 154 69
> GATCTCTGTCTCATCTTTCG 112 94
> GATCTTCTTTTGAAGTAAAC 134 74
> GATCAGCCAGGTCCAAGGCC  81 66
> GATCCTCCGGAACCACAAGA 151  7
> 73310 more rows ...
> 
> $common.dispersion
> [1] 0.1259328
> 
> > fit <- glmFit(d,design,dispersion = d$common.dispersion)
> > lrt <- glmLRT(d,fit,design,coef=2:8)
> Error in glmfit$coefficients %*% contrast : non-conformable arguments
> 
> 
> 
> 
> On 10/14/2011 07:00 AM, Gordon K Smyth wrote:
>> Dear Laura,
>> 
>>> Date: Wed, 12 Oct 2011 14:23:26 +0200
>>> From: lpascual <Laura.pascual at avignon.inra.fr>
>>> To: Bioconductor at r-project.org
>>> Subject: [BioC] edgeR design matrix
>>> 
>>> Hi all,
>>> 
>>> I am new with RNA-seq data analysis. I have data from illumina DGE experiments from 8 different cultivars (duplicates). I also have 4 extra samples, the F1 hybrids (duplicates), obtained from crossing the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE illumina reactions, where all the samples came from the same developmental stage and treatment. So the only different factor I have is the genotype.
>>> 
>>> I am trying to analyze the data with the edgeR package, however I am not sure if I can use the GML function for multiple testings or how should I made my design matrix.
>>> 
>>> To find the differences among the 8 parental lines, I know I can made two ways testing between the different samples, but I was wondering, if I could be able to do a test were the null hypothesis will be there is no difference between the cultivars, and the alternative hypothesis, at least there is difference for one cultivar.
>>> 
>>> To do that, it is possible to use one design matrix like these:
>>> 
>>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8
>>> 1-rep1 1 0 0 0 0 0 0 0
>>> 1-rep2 1 0 0 0 0 0 0 0
>>> 2-rep1 0 1 0 0 0 0 0 0
>>> 2-rep2 0 1 0 0 0 0 0 0
>>> 3-rep1 0 0 1 0 0 0 0 0
>>> 3-rep2 0 0 1 0 0 0 0 0
>>> 4-rep1 0 0 0 1 0 0 0 0
>>> 4-rep2 0 0 0 1 0 0 0 0
>>> 5-rep1 0 0 0 0 1 0 0 0
>>> 5-rep2 0 0 0 0 1 0 0 0
>>> 6-rep1 0 0 0 0 0 1 0 0
>>> 6-rep2 0 0 0 0 0 1 0 0
>>> 7-rep1 0 0 0 0 0 0 1 0
>>> 7-rep2 0 0 0 0 0 0 1 0
>>> 8-rep1 0 0 0 0 0 0 0 1
>>> 8-rep2 0 0 0 0 0 0 0 1
>>> 
>>> I have try it, but to calculate the null hypothesis the gmlLRT seams
>>> just to remove the last column??
>> 
>> You could use that design matrix, but it's inconvenient.  If you want to find transcripts that are different between any of the cultivars, you would use:
>> 
>>  design <- model.matrix(~Cultivar)
>>  fit <- glmFit(y,design)
>>  lrt <- glmLRT(y,fit,design,coef=2:8)
>>  topTags(lrt)
>> 
>> This gives you a test on 7 degrees of freedom (one for each coefficient being tested) for differences between the 8 cultivars.
>> 
>> Of course you have to estimate the dispersions as well, but I assume you know how to do that.
>> 
>> 
>>> Besides I also want to test the differences between two parent lines and
>>> the F1 that I have derived from them I have try a design matrix as follows,
>>> 
>>> 
>>> Cultivar1 Cultivar2 F1
>>> 1-rep1 1 0 0
>>> 1-rep2 1 0 0
>>> 2-rep1 0 1 0
>>> 2-rep2 0 1 0
>>> F1-rep1 0 0 1
>>> F1-rep2 0 0 1
>>> 
>>> But I again face the above mentioned problem.
>> 
>> The above example should give you a clue as to how to do this.
>> 
>> 
>>> Besides I was wondering if I can employ a design matrix where I will be
>>> modeling how many alleles of each cultivar have my sample, doing
>>> something like these:
>>> 
>>> Cultivar1 Cultivar2
>>> 1-rep1 2 0
>>> 1-rep2 2 0
>>> 2-rep1 0 2
>>> 2-rep2 0 2
>>> F1-rep1 0 1
>>> F1-rep2 0 1
>>> 
>>> I will appreciate any advice about the proper way to face these analysis.
>> 
>> You can put a covariate for allele number as a column in your design matrix, but you obviously need a column for the intercept term as well. Basically you need to start using the model.matrix() function to build your design matrices.
>> 
>> 
>>> An just an extra question, in the manual it is recommended to eliminate
>>> all the tags with low counts, as they are no going to be sadistically
>>> significance and they will slow the process. First it is said that the
>>> tags with less than 5 counts are not useful, but then all the tags with
>>> lees than one tag per million (TPM) are removed, however in my case
>>> having 50 million sequences that is all the tags with less than 50 counts.
>>> 
>>> In fact when I do it my tags decrease a lot
>>> 
>>> > dim(d)
>>> [1] 2029802 16
>>> 
>>> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
>>> dim(d)) >1) >= 2, ]
>> 
>> You can just use cpm(d) to compute the counts-per-million.
>> 
>>> > dim (d)
>>> 
>>> [1] 73310 16
>>> 
>>> That is normal? Why did you choose to remove less than 1TPM?
>> 
>> It depends on your library size.  A cutoff of 1 cpm was appropriate for the case study, but if you library sizes are smaller, then you will need to use a smaller cutoff.
>> 
>> Best wishes
>> Gordon
>> 
>>> Thanks in advance
>>> Laura Pascual
>> 
>> ______________________________________________________________________
>> The information in this email is confidential and inte...{{dropped:6}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor