[BioC] Statistics questions regarding the use of Ambion ExFold ERCC standards with Affmetrix ST arrays.

Tue Jan 14 15:34:52 CET 2014

Hi Matt,

On 1/13/2014 9:47 PM, Thornton, Matthew wrote:
> Hello!
>
> I am processing some data collected with GeneChip Mouse Gene 2.0 ST arrays.   I am using the Ambion ExFold ERCC controls (Life Technologies 4456739) These are "spike in" controls consisting of two 'mixes' with the same set of RNA sequences, 92 total, that span 10^6 fold in concentration, furthermore, the difference in concentration between the two 'mixes' is well defined.
>
> I have processed the data using the bioconductor package vsn, using the protocol normalization with "spike-in" controls. I have pulled out the normalized intensities out for the ERCC probes from both groups across my samples 3 treatments and 1 wild-type. When I graph 2 log concentration versus 2 log intensity, I get a sigmoid curve, with a linear region between a 2 log intensity of 6.5 to 10.5. Is it correct to assume that this is the 'dynamic range' of the GeneChip for my experiment? If I have data that is within this range, what would be the most statistically (and scientifically) satisfying statistics that I should obtain (and relate) from the dispersion of the controls to make inference about my data?

I'm not sure what you are asking here. Are you asking if you should just 
restrict to the data that are in the linear range? Or are you asking if 
there is some statistical method that you can use to infer something 
about your data based on the controls?

I will assume the former. Basically what you are seeing is that there is 
a good linear relationship between starting mRNA concentrations and 
expression levels between 2^6 and 2^11 or so. You could then argue that 
data beyond those values are less reliable, and I don't think it would 
be completely crazy to restrict your analysis based on that observation. 
You could do so using something like the kOverA function from 
genefilter, but modified somewhat.

There are two issues here. First, you don't want to use the sample types 
when filtering (e.g., when you filter the data you want to ignore 
everything you know about the samples except for the expression values), 
because to incorporate any phenotypic information will bias your 
results. Second, there are certain patterns of expression that you 
clearly don't want to exclude. For instance, if you have a gene where 
half of the samples have expression values < 2^6, and the other half are 
 > 2^11, you don't necessarily want to exclude that gene. You may well 
have all treated samples > 2^11, and the wild type < 2^6, in which case 
you have a clear difference in expression. So really you want to exclude 
only those genes for which most or all of the samples are < 2^6 or > 2^11.

>
> Additionally from the data there is an expected fold-change between 'mixes' which can be compared to the fold change obtained from data processing using the average intensity across all samples. In my case what I see is that an expected 2 fold change, is seen as 1.1 fold change. What would be the best way to use this information to make inference?

This is a well known phenomenon with microarrays, where the observed 
fold changes are compressed downwards. I don't think there is anything 
to be done with this information, except to acknowledge that this 
phenomenon exists. Certainly if you are using limma to make comparisons 
you could incorporate the fold change into the test, using the treat 
function instead of eBayes, but selecting an lfc value suitably small, 
given the fold change compression.

Best,

Jim

>
> Is there a forum like Stack Exchange biology or biostars that bioconductor list patrons prefer? The reason why I am asking is I because I have graphs which are easier to post in page rather than in list format.
>
> Any feedback or commentary is greatly appreciated.
>
> Thank you!
>
> Sincerely,
>
> Matt
>
> Matthew E. Thornton
>
> Research Lab Specialist
> Saban Research Institute
>
> USC/Children’s Hospital Los Angeles
> 513X,  Mail Stop 35
> 4661 W. Sunset Blvd.
> Los Angeles, CA 90027-6020
>
> matthew.thornton at med.usc.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099