[BioC] edgeR outlier question
Simon Melov
smelov at buckinstitute.org
Tue May 8 03:22:42 CEST 2012
Hi Alessandro,
I don't think this helps me, as I'm not looking to eliminate an entire gene based on a single replicate. I mentioned in my original post that I had applied the filtering discussed at length in the guide, (allowing genes with at least one read, in a minimum of 8 samples was my filtering criteria). But this doesn't address the problem of a very high level of reads in a single sample. This issue of variance should be incorporated into the analysis, and not result in genes being listed as significant due to a high levels in a single sample. This sort of problem is not unusual in the genomics world, and I think the microarray literature had numerous solutions to this sort of problem. I'm surprised it popped up so early in my analysis, as I thought this would have been "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high" number of biological replicates (N=10 per group). This number of replicates going forward is going to be commonplace as sequencing costs tumble. So some guidance as to how to deal with this in edgeR would be very welcome.
thanks
Simon.
On May 7, 2012, at 1:56 PM, Guffanti Alessandro wrote:
That seems to be exactly my same problem, so I am including here Gordon's answer.
Actually filtering at the cpm level worked quite nicely to ameliorate the situation - look at the
latest update (May 2011) of the User Manual to see a neat example of the procedure.
Alessandro
--
Dear Alessandro,
You seem to giving examples of miRs that are expressed at a high degree is
just one sample. The easiest way to deal with such miRs, if you really
don't want to detect them, is to filter out miRs that fail to be expressed
to a reasonable degree in at least four samples (since your groups are of
size four). See for example pages 24-25 of the edgeR user's guide, where
this is done for the Dclk1 mouse case study. We often suggest cpm>1 for
at least m samples, where m is the minimum group size.
Another obvious thing to do is to examine an MDS plot to identify outlier
samples.
--
[BioC] edgeR: effect of 'outlier' tags on differential expression calls
alessandro.guffanti at genomnia.com<http://genomnia.com> alessandro.guffanti at genomnia.com<http://genomnia.com>
Tue Apr 24 12:48:22 CEST 2012
Dear colleagues: I am using edgeR to examine differential expression on
small RNA data
I noticed this problem also when working with SAGE datasets: when just one
of the samples is clearly an outlier, like you can see below for sample 7
(the comparison is 1-4 versus 5-8), there is a call of significant
differential expression which seems to be inappropriate, or at least it
should be reexamined.
How can we diagnose these situations before checking manually the tag
counts for all the significant differential expression calls ? Please note
that these are tumoral samples, so an high sample by sample variability is
expected in principle..
Thanks a lot in advance,
Alessandro
miRNA_ID 1.mirna 2.mirna 3.mirna 4.mirna
5.mirna 6.mirna
7.mirna 8.mirna
hsa-miR-515-3p 3 1 1 1 1 7 1601 3
hsa-miR-518e 4 0 1 0 1 2 1715 2
hsa-miR-520d-3p 0 0 0 0 0 1
243 0
hsa-miR-519c-3p 0 0 0 0 0 1
248 0
hsa-miR-520f 0 0 0 0 0 0 163 0
hsa-miR-519d 12 1 0 1 1 4 1754 1
hsa-miR-520h 0 0 0 0 0 0 189 2
hsa-miR-519c-5p 0 0 0 0 0 0
123 0
hsa-miR-520g 16 1 1 4 2 4 1917 2
hsa-miR-518b 5 0 0 1 1 3 686 1
hsa-miR-517a 100 5 4 2 6 45 10024 3
miRNA_ID logConc logFC P.Value adj.P.Val
hsa-miR-515-3p -15.09154 -8.61753 0.00000 0.00082
hsa-miR-518e -15.30278 -9.22926 0.00000 0.00110
hsa-miR-520d-3p -18.23592 -9.46747 0.00001
0.00201
hsa-miR-519c-3p -17.98705 -9.01722 0.00002
0.00338
hsa-miR-520f -32.04992 -35.93228 0.00002 0.00338
hsa-miR-519d -14.46073 -7.61177 0.00003 0.00338
hsa-miR-520h -18.02925 -8.34496 0.00003 0.00338
hsa-miR-519c-5p -32.25620 -35.51970 0.00004
0.00382
hsa-miR-520g -14.16219 -7.27220 0.00005 0.00382
hsa-miR-518b -15.70611 -7.39997 0.00006 0.00382
hsa-miR-517a -11.74423 -7.21374 0.00006 0.00382
-----------------------------------------------------
Alessandro Guffanti - Bioinformatics, Genomnia srl
Via Nerviano, 31 - 20020 Lainate, Milano, Italy
Ph: +39-0293305.702 Fax: +39-0293305.777
http://www.genomnia.com
"If you can dream it, you can do it" (Walt Disney)
-----Original Message-----
From: Simon Melov <smelov at buckinstitute.org<mailto:smelov at buckinstitute.org>>
To: "bioconductor at r-project.org<mailto:bioconductor at r-project.org>" <bioconductor at r-project.org<mailto:bioconductor at r-project.org>>
Date: Mon, 7 May 2012 12:19:19 -0700
Subject: [BioC] edgeR outlier question
I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the !
same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant".
Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did).
Am I doing something wrong here?
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
-----------------------------------------------------------
Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei
soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati
di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o
ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da
considerarsi vietato ed abusivo.
The information transmitted is intended only for the per...{{dropped:11}}
More information about the Bioconductor
mailing list