[BioC] filter high-throughput microarray data with noise

Mon Sep 11 18:31:12 CEST 2006

Hi Weiwei,

I've removed the R-Help mailing, this question does not really concern
them (except for the subset who's already on the bioc list).

To answer your first question, it is somewhat common yes. The first step
would be to ask yourself why you would be getting different values here.
Could it be that some of the probes are not behaving properly in your
samples? If you have reasons to think that there is one probe which is
more representative, then you might want to only select that one (for
example by variance). If they represented different splice variants,
then you might want to keep all of them around. If you have such
diverging results, I do not think that averaging them would be a good
idea.

The strategy that we used at the beginning was to keep all probes, and
see which ones come up during differential expression or other analyses.
Then you can compare the results to see how the different probes are
reacting and which ones make sense based on what you know of your
samples.

In our case, we have good reasons to think that lots of probes are
misbehaving, for example by looking at genes whose behaviors is known.
We often select the most variables as the representative one.

I do not have any references handy for this, maybe other people do.

Francois

On Mon, 2006-09-11 at 12:11 -0400, Weiwei Shi wrote:
> Dear Listers:
> 
> Currently I am doing a research using a microarray data. I have two
> questions and hope I can get some help from here:
> 
> 1. I have a dataset like the following, in which V1 is geneid,
> v3...are the fold changes of expression levels for different patients.
> There are multiple probes for one gene, so there are multiple rows.
> You can see from column V11 and V13, the fold changes are very
> different. Is it very common in microarray data analysis? Generally
> how to deal with that? I don't want to use a p-value or something like
> threshold to discretize them in this step yet.
> 
>            V1        V3             V5              V7        V9
>      V11        V13
> -2147022884  3.967828  5.010724  3.356568  1.227882   1.481481   1.870871
> -2147022884 -4.031250 -1.441341 -1.036145 -3.583333  -8.953125  -3.201117
> -2147022884 -2.016835 -1.568063 -1.079279 -1.288172 -50.875421 -39.554974
> 
> here is the variance
> > x2.var[2,]
>       Group.1       V3       V5       V7       V9      V11      V13
> -2147022884 17.30989 14.15427 6.495755 5.791014 767.9342 510.5714
> 
> 2. Is there any good reference on this kind of things? like online
> materials or book.
> 
> thanks,