[BioC] finding and averaging replicate gene records

Wed Mar 16 12:51:49 CET 2005

On Mar 16, 2005, at 2:33 AM, zhihua li wrote:

> Hi netter!
>
> In most microarray slides a single gene will be represented by 
> multiple items. Sometimes it's unforseable because they have different 
> genbank accession numbers and you will not find them until you get a 
> unigene list for  all your gene items.
>
> Now I have a dataframe . The rows are gene records(accession number, 
> unigene ID and expression values in different conditions) ; the 1st 
> column is genbank accession numbers, the 2nd column is unigene IDs, 
> from 3rd column on are different conditions). All the accession 
> numbers are unique, but through unigene IDs i can find that some 
> items, though with different accession numbers, are in fact sharing 
> the same unigene ID. I would like to find the gene records containing 
> replicate unigene IDs and merge them into one record by averaging 
> different expression values in the same condition.
>
> Could anyone give me a clue about how to write the code? Or are there 
> any contributed functions can do this stuff?
>

I generally do NOT do this.  While it seems that there should be one 
gene/one value, we know that this isn't generally true in practice.  
You gain little by averaging by having a few fewer genes to go into 
multiple-testing correction, but you stand to lose a huge amount.  In 
the worst-case scenario, you take a "differentially-expressed" probe 
and average it with a poor-performing probe, and end up not finding the 
gene of interest.  If you do not merge those probes, you find one probe 
representing the gene IS differentially-expressed and the other is not. 
  You, of course, have to determine why the two probes for the same gene 
behave differently, but there are many explanations including things 
like probe sequence contamination, transcript variants, array-specific 
effects (like non-uniform background, etc.), and faulty bioinformatics 
(Unigene may place two sequences for different genes into the same 
cluster, for example).

In short, you probably agree that you want to find ALL genes of 
interest and then use biologic validation where necessary to determine 
the relevance of your found genes.  However, veraging expression values 
per gene nearly guarantees that you will sometimes miss genes of 
interest and so is, in my opinion, not warranted.

Sean