[BioC] Reading ArrayExpress data [was Re: questions on the ImaGene data using limma package]
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Oct 30 22:40:37 CET 2008
See http://www.bioconductor.org/docs/postingGuide.html.
Note that attachments are not permitted.
On Thu, 30 Oct 2008, Ming YI [Contr] wrote:
> Dear Gordon:
>
> Thanks a lot for your comments and suggestions. I already successfully read
> all the data into limma objects based on your suggestion using the generic
> method by using the attached target file I edited from their annotation file
> as I sent to you earlier. I did assume that the Cy3 channel is the common
> reference as you guessed.
>
> But the issue remained as you mentioned how actually they did the experiment.
> Based on their E-NCMF-8.idf.txt file from arrayExpress, it appears to be
> dye_swap_design, which is exactly what you guessed. So the data appears to be
> collated by ArrayExpress into data matrices with the Cy3 and Cy5 intensities
> in the same file for each sample. But the concern is in the column of "Label"
> in the file E-NCMF-8_sdrf.txt I sent to you in last email, what does those
> Cy3 and Cy5 mean for each sample, it looks like this column may tell for each
> sample (and corresponding raw data file), what is dye for the sample and the
> other dye would be used for the common reference, which was not mentioned in
> their annotation file. What do you think? if this is true, I may need to
> change my target file coordinately to accommodate this information. This
> assumption makes more sense at least to explain the repeated samples in the
> dataset, which should be the dye-swapping data.
>
> I tried to contact with them for details of the experiment design, that
> should help to sort this out.
>
> By the way, I am not sure why my post not go to the mailing list. I changed a
> bit the address this time, hope it works.
>
> Thanks again for your help. Any additional suggestion would be appreciated as
> well.
>
> Best regards,
>
> Ming
>
>
> At 09:25 PM 10/29/2008, Gordon K Smyth wrote:
>> Dear Ming,
>>
>> Thank you for mailing me example data sets and the annotation spreadsheet
>> from ArrayExpress.
>>
>> You are assuming that the data from ArrayExpress are in ImaGene format.
>> This is incorrect. The reason that limma gives a special treatment to
>> ImaGene files is that, unlike other image analysis software, ImaGene writes
>> the Cy3 and Cy5 channels into separate files. However ArrayExpress has
>> collated the original data into data matrices with the Cy3 and Cy5
>> intensities in the same file for each sample. Therefore you should ignore
>> all references to ImaGene in the limma manual, and instead use the
>> instructions for generic two-color platforms.
>>
>> The data sets you sent me can easily be read into limma using the
>> instructions in the limma User's Guide starting page 14 "What should you do
>> if your image analysis program is not in the above list?" I demonstrate
>> this below.
>>
>> Your emails suggest that you have not yet read any two-color data into
>> limma. It is essential that you try some simple examples before trying a
>> large dataset from ArrayExpress, which will have a complex structure you
>> might not fully understand.
>>
>> I don't fully understand the sample annotation file from ArrayExpress that
>> you sent me, but I doubt that you are interpretting it correctly. It is
>> not in the format you need for a limma targets file. My guess is that each
>> row of the file corresponds to one array, and that each array has been
>> hybridized with a common reference that is not mentioned in the annotation
>> file. This means that the repeated sample names you have noted do not
>> represent matched Cy3 and Cy5 channels, but rather represent dye-swap
>> technical replicates. That is, they are separate arrays.
>>
>> If my guess is correct, then a targets file would be something like below.
>>
>> Let me emphasize that I do not offer a plug-in service to read experimental
>> data posted to ArrayExpress. It is your responsibility to figure out the
>> experimental design and the ArrayExpression data formats. I am just
>> guessing.
>>
>> Best wishes
>> Gordon
>>
>>
>> READING YOUR DATA FILES
>>
>>> f
>> [1] "E-NCMF-8-raw-data-1363346838.txt" "E-NCMF-8-raw-data-1363346856.txt"
>>
>>> ann <- c("Database NCMF:DB:omadhuman","Database
>> ebi.ac.uk:Database:ens_trscrpt_id","Feature coordinates:
>> metaColumn","metaRow","column","row","Reporter identifier","Reporter
>> sequence type")
>>
>>> columns <- list(Rf="ImaGene:Signal Mean_Cy5",Rb="ImaGene:Background
>> Median_Cy5",Gf="ImaGene:Signal Mean_Cy3",Gb="ImaGene:Background
>> Median_Cy3")
>>
>>> RG <- read.maimages(files=f,annotation=ann,columns=columns)
>> Read E-NCMF-8-raw-data-1363346838.txt
>> Read E-NCMF-8-raw-data-1363346856.txt
>>
>>> dim(RG)
>> [1] 37632 2
>>
>>
>> A POSSIBLE TARGETS FILE
>>
>>> targets <- readTargets()
>>> targets
>> Source DiseaseState ArrayDataMatrixFile
>> Cy3 Cy5
>> 1 3560 Squamous Cell Carcinoma
>> E-NCMF-8-raw-data-1363346838.txt Reference SCC3560
>> 2 reference pool of 61 HNSCC Squamous Cell Carcinoma
>> E-NCMF-8-raw-data-1363346856.txt Reference PoolHNSCC
>>
>>
>> On Wed, 29 Oct 2008, Ming YI [Contr] wrote:
>>
>>> Hi, Dear Gordon:
>>>
>>> I tried to use limma to deal with ImaGene dataset I downloaded from
>>> ArrayExpress. I never deal with ImaGene data before and not familiar with
>>> ImaGene data format except knowing that the Cy5 and Cy3 signals are stored
>>> in two separate files for the same sample. I tried to read the data into
>>> limma and normalize them in the context of limma. and I keep running into
>>> issues and errors. and I wish you can help me with this regard:
>>>
>>> I did attach a file (E-NCMF-8_sdrf.txt) that was download from
>>> ArrayExpress can be potentially used for making the target file, and also
>>> I attached two raw data files of the ImaGene dataset as examples. The
>>> thing bothering me is as followed:
>>>
>>> Extract 3538 and Extract 3526 (see column "Extract Name" of
>>> E-NCMF-8_sdrf.txt file) , they do have one Cy5 and one matched Cy3 files,
>>> so that's fine with me. but in particular, for "Extract reference pool of
>>> 61 HNSCC" (see E-NCMF-8_sdrf.txt file), there are multiple Cy3 and Cy5 for
>>> such samples, how should we incorporate that into the target file?
>>>
>>> I intended to use the following code to deal with this ImaGene data
>>>
>>> targets<-readTargets()
>>> files<-targets[,c("FileNameCy3", "FileNameCy5")'
>>> RG<-read.maimages(files, source="imagene")
>>>
>>> but I need the right target file to start with particularly with the issue
>>> I mentioned above.
>>>
>>> Also for normalization, the
>>> RG<-backgroundCorrect(RG, method="normexp", offset=50) still appropiate
>>> for ImaGene data?
>>>
>>> Thanks so much for your help!
>>>
>>> Ming Yi
>>> ABCC
>>> P.O.Box B, Bldg 430
>>> National Cancer Institute/SAIC-Frederick, Inc
>>> Frederick,Maryland
>>> USA
>
More information about the Bioconductor
mailing list