[BioC] take the average log fc for each gene

Tue Aug 13 17:24:16 CEST 2013

Hi Helen,

Please don't take conversations off-list. We like to think of the list 
archives as a searchable repository of knowledge.

On 8/13/2013 5:39 AM, Helen Smith wrote:
> Hi James,
>
> I used RMA then annotated the probes with entrez IDs, to which I had several probes with different logfc's representing the same gene. Therefore I am now just trying to collapse the gene names and take the average fc.

OK, that's what I thought. So for now we will set aside any concerns 
that the 'duplicate' probesets might in fact be intended to (or actually 
do) measure transcript variants, etc.

For this case, I am a big fan of the list structure in R, because it 
makes fast simple iteration possible, and you don't have to do crazy 
loops and whatnot. The downside is the rather opaque way the various 
*apply functions work, and the fairly obscure help pages. When I was 
first figuring out how these things work it was a bit on the painful side.

So an example. Let's take all your data and create a list in which each 
list item contains the data for a given gene. We can then iterate over 
that list and compute the fold change for each gene. We will then 
'flatten' the list back out into a matrix-like structure.

 From your code below I am going to assume that the Entrez Gene ID is in 
the first column, and the fold change is in the second. I will ignore 
all other columns that might exist.

thelst <- tapply(1:nrow(dat), dat[,1], function(x) dat[x,2])

The tapply function takes three main arguments. The thing you want to 
iterate over, a factor that you want to control the iteration, and a 
function. You could use by() for this as well, but I don't like the 
structure that is returned, so I tend to use tapply(). Basically what we 
are doing is creating a vector of numbers that correspond to the rows of 
your data.frame, then using the second argument to separate these row 
numbers into groups according to the Entrez Gene ID, and then using the 
(anonymous) function to select out the fold changes that correspond to 
each gene. This will give us a list that has the Entrez Gene IDs as the 
names of each list item, and the fold changes in each list.

We can now use sapply() to iterate over each list item and return a 
named vector:

thefcs <- sapply(thelst, mean)

We could also be trickier and do in one fell swoop:

thefcs <- tapply(1:nrow(dat), dat[,1], function(x) mean(dat[x,2]))

Best,

Jim

>
> I hope this is clearer,
> Many thanks,
> Helen
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: 12 August 2013 20:15
> To: Helen Smith
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] take the average log fc for each gene
>
> Hi Helen,
>
> On 8/12/2013 2:55 PM, Helen Smith wrote:
>> Hi All.
>>
>> Apologies as I'm just getting to grips with R.
>>
>> I have a set of genes and log fold changes.
>> As the genes have been converted from Affymetrix probes, and up to 11 probes represent one gene, I have a range of different log fold changes for each gene. I would like to take the average log fc for each gene when duplicated in the list.
> This part isn't quite clear to me. I can't tell if you are just summarizing the data using RMA or something similar and are then trying to collapse the data over duplicated genes, or if you are doing something at the probe level.
>
> I think you are trying to do the latter, but will you please let us know before we talk about your code below?
>
> Best,
>
> Jim
>
>
>
>> I used the script below but as I'm a bit of a novice it isn't working too well and I get the error message stated at the bottom:
>>
>> dat<-read.table("test.txt")
>> dim(dat)
>> dat[1:664,1:2]
>>
>> Gene<-dat[,1]
>> fc<-dat[,2]
>>
>> LogFC<-matrix(NA,664,1)
>> for(i in 1:664){
>>                   for(j in 1:1){
>>                                   LogFC[i]<-fc[i]
>>                                   }
>>                   }
>> fc[1:664,1]
>>
>> ####Take Average logfc of multiples of the same gene####
>> gid<-unique(Gene)
>> length(gid)
>> mGene<-matrix(NA,640,1)
>> mGene
>> for(i in 1:640){
>>                                   rid<-which(Gene==gid[i])
>>                                   for(j in 1:1){
>>                                                   mGene[i]<-mean(fc[rid,j])
>>                                   }
>>                   }
>> ####I get the error message "Error in fc[rid, j] : incorrect number of dimensions"
>>
>> Any help would be much appriciated,
>>
>> Many thanks everyone!
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099