[BioC] Adding annotations to GSE datasets
Sean Davis
sdavis2 at mail.nih.gov
Thu May 8 17:27:54 CEST 2014
Hi, Marcelo.
Please keep things on the list so everyone learns from your questions.
http://www.bioconductor.org/packages/release/bioc/html/Biobase.html
Sean
On Thu, May 8, 2014 at 11:23 AM, Marcelo Pereira <marcelops at gmail.com> wrote:
> Also, where can I find the documentation for the ExpressionSet object from
> the BioConductor library?
>
> Thanks again,
> Marcelo
>
>
> On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops at gmail.com>
> wrote:
>>
>> One last question:
>>
>> GSM278765 GSM278766 GSM278767 ...
>> A1BG 5.459950 5.548725 5.477436 ...
>> NAT2 6.728919 6.329578 6.570104 ...
>> ADA 6.861095 7.005730 7.235361 ...
>> CDH2 9.660035 9.189507 9.740223 ...
>> ... 5.644313 5.898675 5.475838 ...
>> ... 7.838040 7.564335 8.397569 ...
>>
>> Each CEL file has a description, telling which kind of tissue that sample
>> is related to.
>>
>> Is there a direct way of translating the column names from (GSM278765,
>> GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2,
>> CC_KIDNEY_3, ...) ?
>>
>> CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3 ...
>> A1BG 5.459950 5.548725 5.477436 ...
>> NAT2 6.728919 6.329578 6.570104 ...
>> ADA 6.861095 7.005730 7.235361 ...
>> CDH2 9.660035 9.189507 9.740223 ...
>> ... 5.644313 5.898675 5.475838 ...
>> ... 7.838040 7.564335 8.397569 ...
>>
>> Thanks,
>> Marcelo
>>
>>
>> On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops at gmail.com>
>> wrote:
>>>
>>> Thanks Sean,
>>>
>>> That is exactly what I was looking for!
>>>
>>> Cheers,
>>> Marcelo
>>>
>>>
>>> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>>>
>>>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at gmail.com>
>>>> wrote:
>>>> > That is all because I am interested in the expression values for some
>>>> > pairs
>>>> > of genes.
>>>> >
>>>> > If I had something like this:
>>>> >
>>>> > GSM278765 GSM278766 GSM278767 ...
>>>> > A1BG 5.459950 5.548725 5.477436 ...
>>>> > NAT2 6.728919 6.329578 6.570104 ...
>>>> > ADA 6.861095 7.005730 7.235361 ...
>>>> > CDH2 9.660035 9.189507 9.740223 ...
>>>> > ... 5.644313 5.898675 5.475838 ...
>>>> > ... 7.838040 7.564335 8.397569 ...
>>>> >
>>>> > Then I could extract lines for the genes of interest (for example,
>>>> > 'A1BG'
>>>> > and 'ADA'), and then plot scatterplots, compute correlation
>>>> > coefficients,
>>>> > etc...
>>>>
>>>> Something like this might work:
>>>>
>>>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',])
>>>>
>>>> Sean
>>>>
>>>>
>>>> > The name of the genes for each line is the only detail that is not
>>>> > present
>>>> > in my dataset.
>>>> >
>>>> > What am I missing here?
>>>> >
>>>> > Thanks,
>>>> > Marcelo
>>>> >
>>>> >
>>>> >
>>>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Hello Sean,
>>>> >>
>>>> >> Thanks for your replies.
>>>> >>
>>>> >> I used to download all the CEL files, and then load, normalize and
>>>> >> generate the ExpressionSet output. All manually, and it was working
>>>> >> fine!
>>>> >>
>>>> >> Then I found out about doing it automatically using the GEOquery
>>>> >> library.
>>>> >> And this is what have been taking my hours lately.
>>>> >>
>>>> >> The output of exprs(gset[[1]]) is the initial point where I got stuck
>>>> >> after a few minutes using the GEOquery library, because I have the
>>>> >> expression, but not the gene's names.
>>>> >>
>>>> >> GSM278765 GSM278766 GSM278767 ...
>>>> >> 1 5.459950 5.548725 5.477436 ...
>>>> >> 10 6.728919 6.329578 6.570104 ...
>>>> >> 100 6.861095 7.005730 7.235361 ...
>>>> >> 1000 9.660035 9.189507 9.740223 ...
>>>> >> 10000 5.644313 5.898675 5.475838 ...
>>>> >> 10001 7.838040 7.564335 8.397569 ...
>>>> >>
>>>> >> After that, I tried to manipulate the output in order to translate 1,
>>>> >> 10,
>>>> >> 100, 1000, to the actual names of the genes. And my last resource
>>>> >> was to
>>>> >> ask here at the forum.
>>>> >>
>>>> >> It is looking good already. I only need to have an extra column,
>>>> >> with the
>>>> >> names of the genes.
>>>> >>
>>>> >> Thanks,
>>>> >> Marcelo
>>>> >>
>>>> >>
>>>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>>> >> wrote:
>>>> >>>
>>>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira
>>>> >>> <marcelops at gmail.com>
>>>> >>> wrote:
>>>> >>> > Hi Sean,
>>>> >>> >
>>>> >>> > Thanks for your answer!
>>>> >>> >
>>>> >>> > That is great already.
>>>> >>> >
>>>> >>> > I can see the gene's names now:
>>>> >>> >
>>>> >>> >> library(GEOquery)
>>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>>> >>> >> head(fData(gset[[1]]))$Gene
>>>> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6
>>>> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3
>>>> >>> > ADA
>>>> >>> > ADAM8 AKT3 ... ZNF254
>>>> >>> >
>>>> >>> > But the data frame only contains these columns.
>>>> >>> >
>>>> >>> >> names(fData(gset[[1]]))
>>>> >>> > [1] "ID" "Gene" "UniGene" "Description"
>>>> >>> > "Ensembl*
>>>> >>> > Chr" "Start (bp)"
>>>> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID"
>>>> >>> >
>>>> >>> > Where is the expression information for each gene?
>>>> >>>
>>>> >>> exprs(gset[[1]])
>>>> >>>
>>>> >>> gset is an ExpressionSet, so you should read a bit about
>>>> >>> ExpressionSets in the Biobase vignette as well as the help page.
>>>> >>>
>>>> >>> Sean
>>>> >>>
>>>> >>>
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> > Marcelo
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>>> >>> > wrote:
>>>> >>> >
>>>> >>> >> Hi, Marcelo.
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira
>>>> >>> >> <marcelops at gmail.com>
>>>> >>> >> wrote:
>>>> >>> >> > Quick question:
>>>> >>> >> >
>>>> >>> >> > I am trying to import some GEO datasets, and having some issues
>>>> >>> >> > with
>>>> >>> >> > the
>>>> >>> >> > annotations:
>>>> >>> >> >
>>>> >>> >> > I can download the GSE dataset using:
>>>> >>> >> >
>>>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > However, it will return me a ExpressionSet with the following
>>>> >>> >> > format:
>>>> >>> >> >
>>>> >>> >> > X1 X10 X100 X1000 ...
>>>> >>> >> > GSM278765
>>>> >>> >> > GSM278766
>>>> >>> >> > GSM278767
>>>> >>> >> > GSM278768
>>>> >>> >> > GSM278769
>>>> >>> >> > ...
>>>> >>> >>
>>>> >>> >> This is not what is returned by GEOquery, so you have done some
>>>> >>> >> manipulation (looks like you did a transpose on the expression
>>>> >>> >> matrix), it seems.
>>>> >>> >>
>>>> >>> >> > This is pretty much what I need, but I still need to translate
>>>> >>> >> > (X1,
>>>> >>> >> > X10,
>>>> >>> >> > X100, X1000, etc...) to the actual names of the genes.
>>>> >>> >>
>>>> >>> >> library(GEOquery)
>>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]]
>>>> >>> >> head(fData(gset))
>>>> >>> >>
>>>> >>> >> The gene symbols are in the "Gene" column:
>>>> >>> >>
>>>> >>> >> genesymbols = fData(gset)$Gene
>>>> >>> >>
>>>> >>> >> Sean
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> >
>>>> >>> >> > Any suggestions?
>>>> >>> >> >
>>>> >>> >> > Thanks,
>>>> >>> >> > Marcelo
>>>> >>> >> >
>>>> >>> >> > [[alternative HTML version deleted]]
>>>> >>> >> >
>>>> >>> >> > _______________________________________________
>>>> >>> >> > Bioconductor mailing list
>>>> >>> >> > Bioconductor at r-project.org
>>>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> >>> >> > Search the archives:
>>>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> >>> >>
>>>> >>> >
>>>> >>> > [[alternative HTML version deleted]]
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > Bioconductor mailing list
>>>> >>> > Bioconductor at r-project.org
>>>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> >>> > Search the archives:
>>>> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> >>
>>>> >>
>>>> >
>>>
>>>
>>
>
More information about the Bioconductor
mailing list