[BioC] help with estrogen dataset in factdesign package

Vincent Carey stvjc at channing.harvard.edu
Wed Jun 10 13:21:19 CEST 2009


On Wed, Jun 10, 2009 at 2:27 AM, Alberto
Goldoni<alberto.goldoni1975 at gmail.com> wrote:
> Dear Gentleman,
>
> i'm not confuded about my datasets and I have read all the documentation and
> all the "vignette" about factDesign, but i have found nothing at all.
>
> In the factDesign vignette there is only one example and the explanation
> about the dataset called "estrogen" contains gene expression levels for 500
> genes from A ymetrix HGU95av2 chips for eight samples from a breast cancer
> cell line.
>

It is true that the factDesign package includes a dataset called 'estrogen'
that has 500 genes.  I will go out on a limb and guess that the
selection of 500 genes
was made for illustrative purposes to avoid having a package that is too big and
to avoid conflict with an impending publication at the time of package release.
The details of selecting 500 genes for the illustrative dataset are
not provided in
the documentation I have seen.  However, in the experimental data archive from
Bioconductor, there is a _package_ called 'estrogen' that provides the
CEL files underlying this dataset.  From this you can get all 12625
probe sets, or
can work at the probe level if you like.

> In my experiment i have 8 samples in more or less the same 2 conditions: ES
> (ST and PUFA) and TYPE (WK and SHR) so i think (but you can correct me)
> factDesign is the right package to perform an analysis from a factorial
> designed microarray experiment.
>

factDesign gives lots of relevant information in the vignette and has
some software
that will help do an effective analysis.  it is not _the_ 'right'
package for this, though,
because other linear modeling packages could be used in the same way.

> WK = normal rats
> SHR means for rats appeared frankly hypertensive at the beginning of the
> study
> PUFA = n-3 polyunsaturated fatty acids (PUFAs)
> ST= starndard rat with no dietary treatment
>
> So in total 8 samples... i know the arrays are a very small number but i'm
> not the experiment designer! I have only to analyze this dataset if it is
> possible from a statistical point of view.
>
> These are my dataset after normalizing with RMA all the samples.
>>> pData(data.rma)
>>
>>>   ES              TYPE
>>> SHR-PUFA5.CEL     PUFA          SHR
>>> SHR-PUFA6.CEL     PUFA          SHR
>>> SHR-st7.CEL           ST               SHR
>>> SHR-st8.CEL           ST               SHR
>>> WK-PUFA3.CEL      PUFA           WK
>>> WK-PUFA4.CEL      PUFA           WK
>>> WK-st1.CEL            ST                WK
>>> WK-st2.CEL            ST                WK
>
>
> So my question is if i have to filter these samples toghether (you can see
> data.rma above and then perform IQR ) or for example rma for all the samples
> together and then filter by IQR WK-st VS WK-PUFA and  SHR-PUFA VS SHR-st
> separately. In the second step i can add what i obtain from the first group
> with the second in order to obtain only one list of genes.

In the paragraph above it is hard to understand what you are talking about.
In the first phrase you talk about filtering samples (possibly using IQR)
but IQR is used in some cases to filter _genes_ nonspecifically.  Then
you mention
rma -- so perhaps you are talking about preprocessing.

>
> So i have read the results of the analysis of the full data set (12,625
> probes, 32 samples) like are discussed in Scholtens, et al. Analyzing
> Factorial Designed Microarray Experiments. Journal of Multivari-ate Analysis
> where the expression estimates were calculated using the rma method after
> quantile normalization from the aff y package, but the paper doesn't explain
> how the technician has obtained the 500 genes.
> The microarray expert has obtained the "estrogen" dataset (500 genes, 8
> samples) from 12,625 probes, 32 samples filtering all the samples togheter
> or adding many different dataset (by the function "combine" or something
> else) from different sub-groups?

Whatever was done with the estrogen CEL files doesn't have much connection
to the detailed conduct of the factorial analysis.  Preprocessing
steps are undertaken in
an attempt to remove nonbiologic sources of variation from our expression data.
If you read the vignette from the estrogen package in the experimental data
archive, you will see that expresso with vsn was employed to
preprocess.  I don't
know if anyone has looked at the impact of preprocessing method on inference for
this dataset, but the vignette proposes some investigation of this question.

>
> If i know the right procedure perhaps i can analyze my dataset in the right
> way.

There is no _right_ way -- the best you can do is make informed
choices that are defensible
in scientific arguments.  The documentation of the packages mentioned
can help you
to make an informed choice -- but there are evidently some gaps.  Your
questions about
filtering have some basis because you are curious about the selection
of the 500 genes
that are in the factDesign estrogen data object, but I believe the
selection of 500 is
immaterial to the statistical analysis -- it was probably mostly for
convenience.  Although the
choice of 500 may have had some other motivation, it has nothing to do
with how you should
analyze your data.
>
> That's  all.
>
> I hope to be clear now, and sorry for the inconvenience.
>
>
> 2009/6/10 Robert Gentleman <rgentlem at fhcrc.org>
>
>> Hi Alberto,
>>
>>
>> Alberto Goldoni wrote:
>> > Hello to everybody
>> >
>> > i'm writing this email because i need some explanation about the
>> "estrogen"
>> > dataset analyzed in the "factDesign" package.
>> > I have to perform the same analysis on 8 sample (affychip):
>> >
>> >> pData(data.rma)
>> >
>> >>   ES              TYPE
>> >> SHR-PUFA5.CEL     PUFA          SHR
>> >> SHR-PUFA6.CEL     PUFA          SHR
>> >> SHR-st7.CEL           ST               SHR
>> >> SHR-st8.CEL           ST               SHR
>> >> WK-PUFA3.CEL      PUFA           WK
>> >> WK-PUFA4.CEL      PUFA           WK
>> >> WK-st1.CEL            ST                WK
>> >> WK-st2.CEL            ST                WK
>> >>
>> >
>> >
>> >> data.rma
>> > ExpressionSet (storageMode: lockedEnvironment)
>> > assayData: 31099 features, 8 samples
>> >   element names: exprs
>> > phenoData
>> >   sampleNames: SHR-PUFA5.CEL, SHR-PUFA6.CEL, ..., WK-st2.CEL  (8 total)
>> >   varLabels and varMetadata description:
>> >     sample: arbitrary numbering
>> > featureData
>> >   featureNames: 1367452_at, 1367453_at, ..., AFFX-TrpnX-M_at  (31099
>> total)
>> >   fvarLabels and fvarMetadata description: none
>> > experimentData: use 'experimentData(object)'
>> > Annotation: rat2302
>> >
>> >
>> > What i need to know is if i have to analyze all toghether: nomalization
>> with
>> > rma, filtering with IQR and then i can perform factDesign technique or i
>> > have to threat the two group (1:4) and (5:8) separately and then to
>> rebuild
>> > and exprset at the end.
>>
>>  You *must* jointly normalize, and that is what we did.
>> There is no such thing as an exprset anymore (they were deprecated a long
>> time ago).
>>
>> >
>> > So my curiosity is to understand how the "estrogen" dataset has been
>> > analyzed in order to obtain the 500 genes listed in pData(estrogen).
>>
>>  You seem very confused. pData accesses the phenotypic data. I have no idea
>> where you are getting 500 genes from? Perhaps you have a script or
>> something?
>> Perhaps you are reading the vignette? If the vignette then you have access
>> to
>> all the code and can easily answer these questions.
>>  I think you will need to be more explicit about where you are getting 500
>> genes from (but I don't see how it has anything to do with
>> pData(estrogen).)
>>
>>  best wishes
>>   Robert
>>
>> >
>> > that all
>> > best regards
>> >
>> >
>>
>> --
>> Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> 206-667-7700
>> rgentlem at fhcrc.org
>>
>
>
>
> --
> -----------------------------------------------------
> Dr. Alberto Goldoni
> Bologna, Italy
> -----------------------------------------------------
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



-- 
Vincent Carey, PhD
Biostatistics, Channing Lab
617 525 2265



More information about the Bioconductor mailing list