[BioC] Multifactorial affy data

Tue Apr 26 12:41:15 CEST 2005

On Apr 26, 2005, at 1:56 AM, Jonathan Arthur wrote:

> Hello all,
>
> I have a data set of the form
>
> Gene   S1  S2   S3 .... Sn
>
> where each column is the expression of the gene labelled by the first 
> column in a different sample. The expression data come from Affymetrix 
> arrays. I am new to BioConductor (and microarray analysis in general), 
> so I have a few questions I hope people may be able to help me with:

In order to use bioconductor, you will need to have a basic 
understanding of R.  I would strongly suggest you spend a couple of 
hours with the "introduction to R" manual (available from the R 
website) learning the basics of data manipulation and finding help (of 
which there is a HUGE amount in R).

>
> 1)  My expression data is the *aggregate* measure as put out by the 
> Affymetrix software. The affy package appears to only deal with the 
> lower level .cel files. Is there a particular reason for this? Is 
> there are package capable of working with the aggregate data?

The affy package (and its associates) deal with .CEL files because the 
.CEL file is pretty close to "raw" data (there is of course image 
extraction done).  Normalization is a particularly important aspect of 
dealing with microarray data and is best performed on raw data.  The 
question is still open for discussion, but for many folks, the 
normalization and summarization methods available via bioconductor 
offer good alternatives over those offered by Affymetrix directly, so 
.CEL files are the best source of the raw data for doing the 
normalization/summarization process.  There are of course practical 
reasons to use .CEL files, also--they are standard and available.

As for using aggregate data, most of the methods for microarray 
analysis work on a kind of "matrix" of values, which you have when you 
have aggregate data.

>
> 2) The various samples divide into two sets (disease and control), but 
> also have clinical co-variables (e.g. male and female). I want to find 
> the set of genes differentially expressed between disease and control 
> while at the same time confirming those differences are specifically 
> due to disease status and not to any of the other co-variables 
> (gender, age, etc.)

If you have covariates, I suggest looking into using limma.  It will 
work just fine with your aggregate data (although you will have to 
remove the "genes" column so that you have only numeric data).  There 
is an excellent user guide (>70 pages of how-to, examples, etc.), also. 
  The mail archives for R and Bioconductor can be quite helpful, also.  
Try searching them for answers, as often folks have put quite a lot of 
energy into answering beginners' questions.

1)  Searchable bioconductor archives
http://files.protsuggest.org/cgi-bin/biocond.cgi
2)  R site search (and archive search)
http://finzi.psych.upenn.edu/search.html

Sean