[BioC] edgeR design matrix
Gordon K Smyth
smyth at wehi.EDU.AU
Fri Oct 14 07:00:37 CEST 2011
Dear Laura,
> Date: Wed, 12 Oct 2011 14:23:26 +0200
> From: lpascual <Laura.pascual at avignon.inra.fr>
> To: Bioconductor at r-project.org
> Subject: [BioC] edgeR design matrix
>
> Hi all,
>
> I am new with RNA-seq data analysis. I have data from illumina DGE
> experiments from 8 different cultivars (duplicates). I also have 4 extra
> samples, the F1 hybrids (duplicates), obtained from crossing the 8
> cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 DGE
> illumina reactions, where all the samples came from the same
> developmental stage and treatment. So the only different factor I have
> is the genotype.
>
> I am trying to analyze the data with the edgeR package, however I am not
> sure if I can use the GML function for multiple testings or how should I
> made my design matrix.
>
> To find the differences among the 8 parental lines, I know I can made
> two ways testing between the different samples, but I was wondering, if
> I could be able to do a test were the null hypothesis will be there is
> no difference between the cultivars, and the alternative hypothesis, at
> least there is difference for one cultivar.
>
> To do that, it is possible to use one design matrix like these:
>
> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 Cultivar8
> 1-rep1 1 0 0 0 0 0 0 0
> 1-rep2 1 0 0 0 0 0 0 0
> 2-rep1 0 1 0 0 0 0 0 0
> 2-rep2 0 1 0 0 0 0 0 0
> 3-rep1 0 0 1 0 0 0 0 0
> 3-rep2 0 0 1 0 0 0 0 0
> 4-rep1 0 0 0 1 0 0 0 0
> 4-rep2 0 0 0 1 0 0 0 0
> 5-rep1 0 0 0 0 1 0 0 0
> 5-rep2 0 0 0 0 1 0 0 0
> 6-rep1 0 0 0 0 0 1 0 0
> 6-rep2 0 0 0 0 0 1 0 0
> 7-rep1 0 0 0 0 0 0 1 0
> 7-rep2 0 0 0 0 0 0 1 0
> 8-rep1 0 0 0 0 0 0 0 1
> 8-rep2 0 0 0 0 0 0 0 1
>
> I have try it, but to calculate the null hypothesis the gmlLRT seams
> just to remove the last column??
You could use that design matrix, but it's inconvenient. If you want to
find transcripts that are different between any of the cultivars, you
would use:
design <- model.matrix(~Cultivar)
fit <- glmFit(y,design)
lrt <- glmLRT(y,fit,design,coef=2:8)
topTags(lrt)
This gives you a test on 7 degrees of freedom (one for each coefficient
being tested) for differences between the 8 cultivars.
Of course you have to estimate the dispersions as well, but I assume you
know how to do that.
> Besides I also want to test the differences between two parent lines and
> the F1 that I have derived from them I have try a design matrix as follows,
>
>
> Cultivar1 Cultivar2 F1
> 1-rep1 1 0 0
> 1-rep2 1 0 0
> 2-rep1 0 1 0
> 2-rep2 0 1 0
> F1-rep1 0 0 1
> F1-rep2 0 0 1
>
> But I again face the above mentioned problem.
The above example should give you a clue as to how to do this.
> Besides I was wondering if I can employ a design matrix where I will be
> modeling how many alleles of each cultivar have my sample, doing
> something like these:
>
> Cultivar1 Cultivar2
> 1-rep1 2 0
> 1-rep2 2 0
> 2-rep1 0 2
> 2-rep2 0 2
> F1-rep1 0 1
> F1-rep2 0 1
>
> I will appreciate any advice about the proper way to face these analysis.
You can put a covariate for allele number as a column in your design
matrix, but you obviously need a column for the intercept term as well.
Basically you need to start using the model.matrix() function to build
your design matrices.
> An just an extra question, in the manual it is recommended to eliminate
> all the tags with low counts, as they are no going to be sadistically
> significance and they will slow the process. First it is said that the
> tags with less than 5 counts are not useful, but then all the tags with
> lees than one tag per million (TPM) are removed, however in my case
> having 50 million sequences that is all the tags with less than 50 counts.
>
> In fact when I do it my tags decrease a lot
>
> > dim(d)
> [1] 2029802 16
>
> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
> dim(d)) >1) >= 2, ]
You can just use cpm(d) to compute the counts-per-million.
> > dim (d)
>
> [1] 73310 16
>
> That is normal? Why did you choose to remove less than 1TPM?
It depends on your library size. A cutoff of 1 cpm was appropriate for
the case study, but if you library sizes are smaller, then you will need
to use a smaller cutoff.
Best wishes
Gordon
> Thanks in advance
> Laura Pascual
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list