[BioC] finding and averaging replicate gene records
Sean Davis
sdavis2 at mail.nih.gov
Wed Mar 16 12:51:49 CET 2005
On Mar 16, 2005, at 2:33 AM, zhihua li wrote:
> Hi netter!
>
> In most microarray slides a single gene will be represented by
> multiple items. Sometimes it's unforseable because they have different
> genbank accession numbers and you will not find them until you get a
> unigene list for all your gene items.
>
> Now I have a dataframe . The rows are gene records(accession number,
> unigene ID and expression values in different conditions) ; the 1st
> column is genbank accession numbers, the 2nd column is unigene IDs,
> from 3rd column on are different conditions). All the accession
> numbers are unique, but through unigene IDs i can find that some
> items, though with different accession numbers, are in fact sharing
> the same unigene ID. I would like to find the gene records containing
> replicate unigene IDs and merge them into one record by averaging
> different expression values in the same condition.
>
> Could anyone give me a clue about how to write the code? Or are there
> any contributed functions can do this stuff?
>
I generally do NOT do this. While it seems that there should be one
gene/one value, we know that this isn't generally true in practice.
You gain little by averaging by having a few fewer genes to go into
multiple-testing correction, but you stand to lose a huge amount. In
the worst-case scenario, you take a "differentially-expressed" probe
and average it with a poor-performing probe, and end up not finding the
gene of interest. If you do not merge those probes, you find one probe
representing the gene IS differentially-expressed and the other is not.
You, of course, have to determine why the two probes for the same gene
behave differently, but there are many explanations including things
like probe sequence contamination, transcript variants, array-specific
effects (like non-uniform background, etc.), and faulty bioinformatics
(Unigene may place two sequences for different genes into the same
cluster, for example).
In short, you probably agree that you want to find ALL genes of
interest and then use biologic validation where necessary to determine
the relevance of your found genes. However, veraging expression values
per gene nearly guarantees that you will sometimes miss genes of
interest and so is, in my opinion, not warranted.
Sean
More information about the Bioconductor
mailing list