[BioC] finding and averaging replicate gene records

zhihua li lzhtom at hotmail.com
Thu Mar 17 04:19:09 CET 2005

Thanks to all your reply.

It is true that by averaging expression values for (putatively) the same 
gene we will lose some information. But sometimes it's the reduction of the 
data size that is more favorable. Especially when one is trying to perform 
a computation-consuming algorithm to one's data. So I think maybe sometimes 
it's worthy to do averaging.

Thanks again!

>From: "Tomas Radivoyevitch" <radivot at hal.EPBI.cwru.edu>
>To: "Sean Davis" <sdavis2 at mail.nih.gov>, "zhihua li" <lzhtom at hotmail.com>
>CC: <bioconductor at stat.math.ethz.ch>
>Subject: Re: [BioC] finding and averaging replicate gene records
>Date: Wed, 16 Mar 2005 08:31:14 -0500
>Agreeing with Sean here, in my last experience where I had to reduce 
>each gene to a single metric, using Affy data I found that taking 
>the probe set with the maximum average value across all chips in the 
>dataset worked well [e.g. in two group situations the resulting 
>choices tended to be probe sets with smaller (if not the smallest) P 
>----- Original Message ----- From: "Sean Davis" 
><sdavis2 at mail.nih.gov>
>To: "zhihua li" <lzhtom at hotmail.com>
>Cc: <bioconductor at stat.math.ethz.ch>
>Sent: Wednesday, March 16, 2005 6:51 AM
>Subject: Re: [BioC] finding and averaging replicate gene records
>>On Mar 16, 2005, at 2:33 AM, zhihua li wrote:
>>>Hi netter!
>>>In most microarray slides a single gene will be represented by 
>>>multiple items. Sometimes it's unforseable because they have 
>>>different genbank accession numbers and you will not find them 
>>>until you get a unigene list for  all your gene items.
>>>Now I have a dataframe . The rows are gene records(accession 
>>>number, unigene ID and expression values in different conditions) 
>>>; the 1st column is genbank accession numbers, the 2nd column is 
>>>unigene IDs, from 3rd column on are different conditions). All the 
>>>accession numbers are unique, but through unigene IDs i can find 
>>>that some items, though with different accession numbers, are in 
>>>fact sharing the same unigene ID. I would like to find the gene 
>>>records containing replicate unigene IDs and merge them into one 
>>>record by averaging different expression values in the same 
>>>Could anyone give me a clue about how to write the code? Or are 
>>>there any contributed functions can do this stuff?
>>I generally do NOT do this.  While it seems that there should be 
>>one gene/one value, we know that this isn't generally true in 
>>practice.  You gain little by averaging by having a few fewer genes 
>>to go into multiple-testing correction, but you stand to lose a 
>>huge amount.  In the worst-case scenario, you take a 
>>"differentially-expressed" probe and average it with a 
>>poor-performing probe, and end up not finding the gene of interest. 
>>  If you do not merge those probes, you find one probe representing 
>>the gene IS differentially-expressed and the other is not. You, of 
>>course, have to determine why the two probes for the same gene 
>>behave differently, but there are many explanations including 
>>things like probe sequence contamination, transcript variants, 
>>array-specific effects (like non-uniform background, etc.), and 
>>faulty bioinformatics (Unigene may place two sequences for 
>>different genes into the same cluster, for example).
>>In short, you probably agree that you want to find ALL genes of 
>>interest and then use biologic validation where necessary to 
>>determine the relevance of your found genes.  However, veraging 
>>expression values per gene nearly guarantees that you will 
>>sometimes miss genes of interest and so is, in my opinion, not 
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch

More information about the Bioconductor mailing list