[BioC] Reading ArrayExpress data [was Re: questions on the ImaGene data using limma package]

Fri Oct 31 10:57:53 CET 2008

Dear Ming,

I would like to mention that there is now a package named ArrayExpress 
that download the data from the ArrayExpress repository and build 
automatically a Bioconductor object (AffyBatch, ExpressionSet or 
NChannelSet) containing the expression data, the experiment data 
(MIAME), the sample annotation when compatible and the feature annotation.

By calling:
obj = ArrayExpress("E-NCMF-8")
You will obtain an NChannelSet with the expression values.
In this particular case, as you have seen, instead of having 2 lines per 
array in the sdrf file (one for each dye), there is only one. Therefore, 
the sample annotation does not match the expression matrix and the 
phenoData (sample annotation) will not be filled automatically. If you 
know how to correct the sdrf file, you can call:
data = getAE("E-NCMF-8")
It downloads all the files needed to build an object for this dataset. 
Then, you can manually correct the sdrf file so that there are two line 
per arrays (but as you said, you need to understand what information to 
fill here) and call:
obj = magetab2bioc(rawfiles = data$rawfiles,
      sdrf = data$sdrf,
      idf = data$idf)
Here, obj is a NChannelSet with the sample annotation in the phenoData.

In this case, it seems you have done the work manually, but if you need 
to import more data from ArrayExpress, hopefully, this automatic way can 
help.

Best wishes,
Audrey

Gordon K Smyth wrote:
> See http://www.bioconductor.org/docs/postingGuide.html.
> Note that attachments are not permitted.
>
>
> On Thu, 30 Oct 2008, Ming YI [Contr] wrote:
>
>> Dear Gordon:
>>
>> Thanks a lot for your comments and suggestions. I already 
>> successfully read all the data into limma objects based on your 
>> suggestion using the generic method by using the attached target file 
>> I edited from their annotation file as I sent to you earlier. I did 
>> assume that the Cy3 channel is the common reference as you guessed.
>>
>> But the issue remained as you mentioned how actually they did the 
>> experiment. Based on their E-NCMF-8.idf.txt file from arrayExpress,  
>> it appears to be dye_swap_design, which is exactly what you guessed. 
>> So the data appears to be collated by ArrayExpress into data matrices 
>> with the Cy3 and Cy5 intensities in the same file for each sample. 
>> But the concern is in the column of "Label" in the file 
>> E-NCMF-8_sdrf.txt  I sent to you in last email, what does those Cy3 
>> and Cy5 mean for each sample, it looks like this column may tell for 
>> each sample (and corresponding raw data file), what is dye for the 
>> sample and the other dye would be used for the common reference, 
>> which was not mentioned in their annotation file. What do you think? 
>> if this is true, I may need to change my target file coordinately to 
>> accommodate this information. This assumption makes more sense at 
>> least to explain the repeated samples in the dataset, which should be 
>> the dye-swapping data.
>>
>> I tried to contact with them for details of the experiment design, 
>> that should help to sort this out.
>>
>> By the way, I am not sure why my post not go to the mailing list. I 
>> changed a bit the address this time, hope it works.
>>
>> Thanks again for your help. Any additional suggestion would be 
>> appreciated as well.
>>
>> Best regards,
>>
>> Ming
>>
>>
>> At 09:25 PM 10/29/2008, Gordon K Smyth wrote:
>>> Dear Ming,
>>>
>>> Thank you for mailing me example data sets and the annotation 
>>> spreadsheet from ArrayExpress.
>>>
>>> You are assuming that the data from ArrayExpress are in ImaGene 
>>> format. This is incorrect.  The reason that limma gives a special 
>>> treatment to ImaGene files is that, unlike other image analysis 
>>> software, ImaGene writes the Cy3 and Cy5 channels into separate 
>>> files.  However ArrayExpress has collated the original data into 
>>> data matrices with the Cy3 and Cy5 intensities in the same file for 
>>> each sample.  Therefore you should ignore all references to ImaGene 
>>> in the limma manual, and instead use the instructions for generic 
>>> two-color platforms.
>>>
>>> The data sets you sent me can easily be read into limma using the 
>>> instructions in the limma User's Guide starting page 14 "What should 
>>> you do if your image analysis program is not in the above list?"  I 
>>> demonstrate this below.
>>>
>>> Your emails suggest that you have not yet read any two-color data 
>>> into limma.  It is essential that you try some simple examples 
>>> before trying a large dataset from ArrayExpress, which will have a 
>>> complex structure you might not fully understand.
>>>
>>> I don't fully understand the sample annotation file from 
>>> ArrayExpress that you sent me, but I doubt that you are 
>>> interpretting it correctly.  It is not in the format you need for a 
>>> limma targets file.  My guess is that each row of the file 
>>> corresponds to one array, and that each array has been hybridized 
>>> with a common reference that is not mentioned in the annotation 
>>> file.  This means that the repeated sample names you have noted do 
>>> not represent matched Cy3 and Cy5 channels, but rather represent 
>>> dye-swap technical replicates.  That is, they are separate arrays.
>>>
>>> If my guess is correct, then a targets file would be something like 
>>> below.
>>>
>>> Let me emphasize that I do not offer a plug-in service to read 
>>> experimental data posted to ArrayExpress.  It is your responsibility 
>>> to figure out the experimental design and the ArrayExpression data 
>>> formats. I am just guessing.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> READING YOUR DATA FILES
>>>
>>>> f
>>> [1] "E-NCMF-8-raw-data-1363346838.txt" 
>>> "E-NCMF-8-raw-data-1363346856.txt"
>>>
>>>> ann <- c("Database NCMF:DB:omadhuman","Database
>>> ebi.ac.uk:Database:ens_trscrpt_id","Feature coordinates: 
>>> metaColumn","metaRow","column","row","Reporter identifier","Reporter 
>>> sequence type")
>>>
>>>> columns <- list(Rf="ImaGene:Signal Mean_Cy5",Rb="ImaGene:Background
>>> Median_Cy5",Gf="ImaGene:Signal Mean_Cy3",Gb="ImaGene:Background 
>>> Median_Cy3")
>>>
>>>> RG <- read.maimages(files=f,annotation=ann,columns=columns)
>>> Read E-NCMF-8-raw-data-1363346838.txt
>>> Read E-NCMF-8-raw-data-1363346856.txt
>>>
>>>> dim(RG)
>>> [1] 37632     2
>>>
>>>
>>> A POSSIBLE TARGETS FILE
>>>
>>>> targets <- readTargets()
>>>> targets
>>>                       Source            DiseaseState 
>>> ArrayDataMatrixFile Cy3       Cy5
>>> 1                       3560 Squamous Cell Carcinoma 
>>> E-NCMF-8-raw-data-1363346838.txt Reference   SCC3560
>>> 2 reference pool of 61 HNSCC Squamous Cell Carcinoma 
>>> E-NCMF-8-raw-data-1363346856.txt Reference PoolHNSCC
>>>
>>>
>>> On Wed, 29 Oct 2008, Ming YI [Contr] wrote:
>>>
>>>> Hi, Dear Gordon:
>>>>
>>>> I tried to use limma to deal with ImaGene dataset I downloaded from 
>>>> ArrayExpress. I never deal with ImaGene data before and not 
>>>> familiar with ImaGene data format except knowing that the Cy5 and 
>>>> Cy3 signals are stored in two separate files for the same sample. I 
>>>> tried to read the data into limma and normalize them in the context 
>>>> of limma. and I keep running into issues and errors. and I wish you 
>>>> can help me with this regard:
>>>>
>>>> I did attach a file (E-NCMF-8_sdrf.txt) that was download from 
>>>> ArrayExpress can be potentially used for making the target file, 
>>>> and also I attached two raw data files of the ImaGene dataset as 
>>>> examples. The thing bothering me is as followed:
>>>>
>>>> Extract 3538  and Extract 3526 (see column "Extract Name" of 
>>>> E-NCMF-8_sdrf.txt file) , they do have one Cy5 and one matched Cy3 
>>>> files, so that's fine with me. but in particular, for "Extract 
>>>> reference pool of 61 HNSCC" (see E-NCMF-8_sdrf.txt file), there are 
>>>> multiple Cy3 and Cy5 for such samples, how should we incorporate 
>>>> that into the target file?
>>>>
>>>> I intended to use the following code to deal with this ImaGene data
>>>>
>>>> targets<-readTargets()
>>>> files<-targets[,c("FileNameCy3", "FileNameCy5")'
>>>> RG<-read.maimages(files, source="imagene")
>>>>
>>>> but I need the right target file to start with particularly with 
>>>> the issue I mentioned above.
>>>>
>>>> Also for normalization, the
>>>> RG<-backgroundCorrect(RG, method="normexp", offset=50) still 
>>>> appropiate for ImaGene data?
>>>>
>>>> Thanks so much for your help!
>>>>
>>>> Ming Yi
>>>> ABCC
>>>> P.O.Box B, Bldg 430
>>>> National Cancer Institute/SAIC-Frederick, Inc
>>>> Frederick,Maryland
>>>> USA
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Audrey Kauffmann
EMBL - EBI
Cambridge UK
http://www.ebi.ac.uk/~audrey