[BioC] Multifactorial affy data
Sean Davis
sdavis2 at mail.nih.gov
Tue Apr 26 12:41:15 CEST 2005
On Apr 26, 2005, at 1:56 AM, Jonathan Arthur wrote:
> Hello all,
>
> I have a data set of the form
>
> Gene S1 S2 S3 .... Sn
>
> where each column is the expression of the gene labelled by the first
> column in a different sample. The expression data come from Affymetrix
> arrays. I am new to BioConductor (and microarray analysis in general),
> so I have a few questions I hope people may be able to help me with:
In order to use bioconductor, you will need to have a basic
understanding of R. I would strongly suggest you spend a couple of
hours with the "introduction to R" manual (available from the R
website) learning the basics of data manipulation and finding help (of
which there is a HUGE amount in R).
>
> 1) My expression data is the *aggregate* measure as put out by the
> Affymetrix software. The affy package appears to only deal with the
> lower level .cel files. Is there a particular reason for this? Is
> there are package capable of working with the aggregate data?
The affy package (and its associates) deal with .CEL files because the
.CEL file is pretty close to "raw" data (there is of course image
extraction done). Normalization is a particularly important aspect of
dealing with microarray data and is best performed on raw data. The
question is still open for discussion, but for many folks, the
normalization and summarization methods available via bioconductor
offer good alternatives over those offered by Affymetrix directly, so
.CEL files are the best source of the raw data for doing the
normalization/summarization process. There are of course practical
reasons to use .CEL files, also--they are standard and available.
As for using aggregate data, most of the methods for microarray
analysis work on a kind of "matrix" of values, which you have when you
have aggregate data.
>
> 2) The various samples divide into two sets (disease and control), but
> also have clinical co-variables (e.g. male and female). I want to find
> the set of genes differentially expressed between disease and control
> while at the same time confirming those differences are specifically
> due to disease status and not to any of the other co-variables
> (gender, age, etc.)
If you have covariates, I suggest looking into using limma. It will
work just fine with your aggregate data (although you will have to
remove the "genes" column so that you have only numeric data). There
is an excellent user guide (>70 pages of how-to, examples, etc.), also.
The mail archives for R and Bioconductor can be quite helpful, also.
Try searching them for answers, as often folks have put quite a lot of
energy into answering beginners' questions.
1) Searchable bioconductor archives
http://files.protsuggest.org/cgi-bin/biocond.cgi
2) R site search (and archive search)
http://finzi.psych.upenn.edu/search.html
Sean
More information about the Bioconductor
mailing list