[BioC] filter high-throughput microarray data with noise

Mon Sep 11 18:52:56 CEST 2006

Dear Francois and others:

Thank you and I cc to r-help since I just tried to get more
suggestions. But I think keeping it at Bioconduct is totally fine with
me.

I am trying my idea on some pathway analysis and the data used here is
a real medical data for a disease with unclear mechanism. The probes
here are different-splices for one gene so I need to keep all of them
for my analysis. Currently I do not have knowledge to evaluate the
behaviors of the probes.

By "We often select the most variables as the representative one.", do
you mean "select the most samples or most probes"?

I agreed with you that using an average is not a good idea. That's why
I need some filtering mechanism or something else. I believe it is a
common situation people meet with when they deal with high-throughput
data with large noises. So my second question is looking for some
general reference or experience.

Thanks for other suggestions,

On 9/11/06, Francois Pepin <fpepin at cs.mcgill.ca> wrote:
> Hi Weiwei,
>
> I've removed the R-Help mailing, this question does not really concern
> them (except for the subset who's already on the bioc list).
>
> To answer your first question, it is somewhat common yes. The first step
> would be to ask yourself why you would be getting different values here.
> Could it be that some of the probes are not behaving properly in your
> samples? If you have reasons to think that there is one probe which is
> more representative, then you might want to only select that one (for
> example by variance). If they represented different splice variants,
> then you might want to keep all of them around. If you have such
> diverging results, I do not think that averaging them would be a good
> idea.
>
> The strategy that we used at the beginning was to keep all probes, and
> see which ones come up during differential expression or other analyses.
> Then you can compare the results to see how the different probes are
> reacting and which ones make sense based on what you know of your
> samples.
>
> In our case, we have good reasons to think that lots of probes are
> misbehaving, for example by looking at genes whose behaviors is known.
> We often select the most variables as the representative one.
>
> I do not have any references handy for this, maybe other people do.
>
> Francois
>
> On Mon, 2006-09-11 at 12:11 -0400, Weiwei Shi wrote:
> > Dear Listers:
> >
> > Currently I am doing a research using a microarray data. I have two
> > questions and hope I can get some help from here:
> >
> > 1. I have a dataset like the following, in which V1 is geneid,
> > v3...are the fold changes of expression levels for different patients.
> > There are multiple probes for one gene, so there are multiple rows.
> > You can see from column V11 and V13, the fold changes are very
> > different. Is it very common in microarray data analysis? Generally
> > how to deal with that? I don't want to use a p-value or something like
> > threshold to discretize them in this step yet.
> >
> >            V1        V3             V5              V7        V9
> >      V11        V13
> > -2147022884  3.967828  5.010724  3.356568  1.227882   1.481481   1.870871
> > -2147022884 -4.031250 -1.441341 -1.036145 -3.583333  -8.953125  -3.201117
> > -2147022884 -2.016835 -1.568063 -1.079279 -1.288172 -50.875421 -39.554974
> >
> > here is the variance
> > > x2.var[2,]
> >       Group.1       V3       V5       V7       V9      V11      V13
> > -2147022884 17.30989 14.15427 6.495755 5.791014 767.9342 510.5714
> >
> > 2. Is there any good reference on this kind of things? like online
> > materials or book.
> >
> > thanks,
>
>

-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III