[BioC] edgeR outlier question

Robert Castelo robert.castelo at upf.edu
Tue May 8 08:41:07 CEST 2012


Simon, Ann,

if you just need a simple two-group sample comparison you may want to 
try the tweeDEseq package which is based on a more flexible discrete 
count distribution (the Poisson-Tweedie) which includes the 
negative-binomial as a special case. the case you describe below could 
be described as a heavy-tail behavior, this is also commented in figure 
2 of the tweeDEseq vignette where one single sample increases the fold 
change, the Poisson-Tweedie distribution provides additional flexibility 
to fit these cases.

if you don't want to go through the vignette just do 'example(tweeDE)' 
to see how it works and if you have several CPU cores load the 
'parallel' library first as computations with the Poisson-Tweedie model 
take longer than with the negative binomial. as mentioned in other 
posts, filtering is great and you can find some of the strategies 
mentioned in the function 'filterCounts()'.

cheers,
robert.

On 05/08/2012 03:22 AM, Simon Melov wrote:
> Hi Alessandro,
> I don't think this helps me, as I'm not looking to eliminate an entire gene based on
> a single replicate. I mentioned in my original post that I had applied the filtering
discussed at length in the guide, (allowing genes with at least one 
read, in a minimum
> of 8 samples was my filtering criteria). But this doesn't address the problem of a very
> high level of reads in a single sample. This issue of variance should be incorporated
> into the analysis, and not result in genes being listed as significant due to a high
> levels in a single sample. This sort of problem is not unusual in the genomics world,
> and I think the microarray literature had numerous solutions to this sort of problem.
> I'm surprised it popped up so early in my analysis, as I thought this would have been
> "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high"
> number of biological replicates (N=10 per group). This number of replicates going
> forward is going to be commonplace as sequencing costs tumble. So some guidance as to
> how to deal with this in edgeR would be very welcome.
>
> thanks
>
> Simon.
>
> On May 7, 2012, at 1:56 PM, Guffanti Alessandro wrote:
>
> That seems to be exactly my same problem, so I am including here Gordon's answer.
>
> Actually filtering at the cpm level worked quite nicely to ameliorate the situation - look at the
> latest update (May 2011) of the User Manual to see a neat example of the procedure.
>
>
> Alessandro
>
> --
>
> Dear Alessandro,
>
> You seem to giving examples of miRs that are expressed at a high degree is
> just one sample.  The easiest way to deal with such miRs, if you really
> don't want to detect them, is to filter out miRs that fail to be expressed
> to a reasonable degree in at least four samples (since your groups are of
> size four).  See for example pages 24-25 of the edgeR user's guide, where
> this is done for the Dclk1 mouse case study.  We often suggest cpm>1 for
> at least m samples, where m is the minimum group size.
>
> Another obvious thing to do is to examine an MDS plot to identify outlier
> samples.
>
> --
>
> [BioC] edgeR: effect of 'outlier' tags on differential expression calls
> alessandro.guffanti at genomnia.com<http://genomnia.com>  alessandro.guffanti at genomnia.com<http://genomnia.com>
> Tue Apr 24 12:48:22 CEST 2012
>
> Dear colleagues: I am using edgeR to examine differential expression on
> small RNA data
>
> I noticed this problem also when working with SAGE datasets: when just one
> of the samples is clearly an outlier, like you can see below for sample 7
> (the comparison is 1-4 versus 5-8), there is a call of significant
> differential expression which seems to be inappropriate, or at least it
> should be reexamined.
>
> How can we diagnose these situations before checking manually the tag
> counts for all the significant differential expression calls ? Please note
> that these are tumoral samples, so an high sample by sample variability is
> expected in principle..
>
> Thanks a lot in advance,
>
> Alessandro
>
>
> miRNA_ID    1.mirna    2.mirna    3.mirna    4.mirna
> 5.mirna    6.mirna
> 7.mirna    8.mirna
> hsa-miR-515-3p    3    1    1    1    1    7    1601    3
> hsa-miR-518e    4    0    1    0    1    2    1715    2
> hsa-miR-520d-3p    0    0    0    0    0    1
> 243    0
> hsa-miR-519c-3p    0    0    0    0    0    1
> 248    0
> hsa-miR-520f    0    0    0    0    0    0    163    0
> hsa-miR-519d    12    1    0    1    1    4    1754    1
> hsa-miR-520h    0    0    0    0    0    0    189    2
> hsa-miR-519c-5p    0    0    0    0    0    0
> 123    0
> hsa-miR-520g    16    1    1    4    2    4    1917    2
> hsa-miR-518b    5    0    0    1    1    3    686    1
> hsa-miR-517a    100    5    4    2    6    45    10024    3
>
>
>
> miRNA_ID    logConc    logFC    P.Value    adj.P.Val
> hsa-miR-515-3p    -15.09154    -8.61753    0.00000    0.00082
> hsa-miR-518e    -15.30278    -9.22926    0.00000    0.00110
> hsa-miR-520d-3p    -18.23592    -9.46747    0.00001
> 0.00201
> hsa-miR-519c-3p    -17.98705    -9.01722    0.00002
> 0.00338
> hsa-miR-520f    -32.04992    -35.93228    0.00002    0.00338
> hsa-miR-519d    -14.46073    -7.61177    0.00003    0.00338
> hsa-miR-520h    -18.02925    -8.34496    0.00003    0.00338
> hsa-miR-519c-5p    -32.25620    -35.51970    0.00004
> 0.00382
> hsa-miR-520g    -14.16219    -7.27220    0.00005    0.00382
> hsa-miR-518b    -15.70611    -7.39997    0.00006    0.00382
> hsa-miR-517a    -11.74423    -7.21374    0.00006    0.00382
>
> -----------------------------------------------------
> Alessandro Guffanti - Bioinformatics, Genomnia srl
> Via Nerviano, 31 - 20020 Lainate, Milano, Italy
> Ph: +39-0293305.702 Fax: +39-0293305.777
> http://www.genomnia.com
> "If you can dream it, you can do it" (Walt Disney)
>
> -----Original Message-----
> From: Simon Melov<smelov at buckinstitute.org<mailto:smelov at buckinstitute.org>>
> To: "bioconductor at r-project.org<mailto:bioconductor at r-project.org>"<bioconductor at r-project.org<mailto:bioconductor at r-project.org>>
> Date: Mon, 7 May 2012 12:19:19 -0700
> Subject: [BioC] edgeR outlier question
>
> I have a reasonable RNASeq data set of 10 biological replicates of a control group versus 10 biological replicates experimental I've gone through the edgeR workflow, and get a nice list of about 1000 genes differentially expressed due to the experimental manipulation. I input the data based on total reads per gene (I'd like to get to exons too, but first things first). The data is obtained via a paired end strategy, so its pretty good quality. The number of reads per sample (library) is about 10 million reads each. My question is, as I go through list of significant genes which are differentially expressed between the two groups  (normalized via the workflow), ranked by BH FDR down to 0.05, I see genes being judged as differentially expressed which have very low expression in most samples, yet are thrown off by 1 or 2 values, thereby achieving statistical significance. For example, a gene might have between 1 and 2 counts per million reads in one group, and be basically the
 !
>   same in the other group, but one of the values is perhaps at a 1000 or so counts, which seems to throw off the entire group, thereby becoming "significant".
>
> Shouldn't edgeR take into account this sort of biological variation within a group and account for it in assessing significance? Its clear that in the above example, that sample is an outlier, and therefore the variance is so high, so it shouldn't be ranked as being differentially expressed. I filtered the data by applying the criteria of at least 1 count per sample, and I have to have at least 8 samples per group which have this. Should there be an additional filtering criteria to exclude these outliers? or doesn't edgeR take into account this sort of situation (I thought it did).
>
> Am I doing something wrong here?
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -----------------------------------------------------------
> Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei
> soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati
> di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o
> ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da
> considerarsi vietato ed abusivo.
>
> The information transmitted is intended only for the per...{{dropped:11}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list