[BioC] Adding annotations to GSE datasets
Sean Davis
sdavis2 at mail.nih.gov
Thu May 8 17:30:28 CEST 2014
On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops at gmail.com> wrote:
> One last question:
>
> GSM278765 GSM278766 GSM278767 ...
> A1BG 5.459950 5.548725 5.477436 ...
> NAT2 6.728919 6.329578 6.570104 ...
> ADA 6.861095 7.005730 7.235361 ...
> CDH2 9.660035 9.189507 9.740223 ...
> ... 5.644313 5.898675 5.475838 ...
> ... 7.838040 7.564335 8.397569 ...
>
> Each CEL file has a description, telling which kind of tissue that sample is
> related to.
>
> Is there a direct way of translating the column names from (GSM278765,
> GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2,
> CC_KIDNEY_3, ...) ?
>
> CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3 ...
> A1BG 5.459950 5.548725 5.477436 ...
> NAT2 6.728919 6.329578 6.570104 ...
> ADA 6.861095 7.005730 7.235361 ...
> CDH2 9.660035 9.189507 9.740223 ...
> ... 5.644313 5.898675 5.475838 ...
> ... 7.838040 7.564335 8.397569 ...
>
> Thanks,
> Marcelo
You'll need to do a little work using sub(), but this information is
typically in one of the columns of:
pData(gset[[1]])
This blog post by Rafa Irizarry might be helpful to understand how an
ExpressionSet works:
http://simplystatistics.org/2014/02/03/the-three-tables-for-genomics-collaborations/
Sean
>
> On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops at gmail.com>
> wrote:
>>
>> Thanks Sean,
>>
>> That is exactly what I was looking for!
>>
>> Cheers,
>> Marcelo
>>
>>
>> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>>
>>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at gmail.com>
>>> wrote:
>>> > That is all because I am interested in the expression values for some
>>> > pairs
>>> > of genes.
>>> >
>>> > If I had something like this:
>>> >
>>> > GSM278765 GSM278766 GSM278767 ...
>>> > A1BG 5.459950 5.548725 5.477436 ...
>>> > NAT2 6.728919 6.329578 6.570104 ...
>>> > ADA 6.861095 7.005730 7.235361 ...
>>> > CDH2 9.660035 9.189507 9.740223 ...
>>> > ... 5.644313 5.898675 5.475838 ...
>>> > ... 7.838040 7.564335 8.397569 ...
>>> >
>>> > Then I could extract lines for the genes of interest (for example,
>>> > 'A1BG'
>>> > and 'ADA'), and then plot scatterplots, compute correlation
>>> > coefficients,
>>> > etc...
>>>
>>> Something like this might work:
>>>
>>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',])
>>>
>>> Sean
>>>
>>>
>>> > The name of the genes for each line is the only detail that is not
>>> > present
>>> > in my dataset.
>>> >
>>> > What am I missing here?
>>> >
>>> > Thanks,
>>> > Marcelo
>>> >
>>> >
>>> >
>>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello Sean,
>>> >>
>>> >> Thanks for your replies.
>>> >>
>>> >> I used to download all the CEL files, and then load, normalize and
>>> >> generate the ExpressionSet output. All manually, and it was working
>>> >> fine!
>>> >>
>>> >> Then I found out about doing it automatically using the GEOquery
>>> >> library.
>>> >> And this is what have been taking my hours lately.
>>> >>
>>> >> The output of exprs(gset[[1]]) is the initial point where I got stuck
>>> >> after a few minutes using the GEOquery library, because I have the
>>> >> expression, but not the gene's names.
>>> >>
>>> >> GSM278765 GSM278766 GSM278767 ...
>>> >> 1 5.459950 5.548725 5.477436 ...
>>> >> 10 6.728919 6.329578 6.570104 ...
>>> >> 100 6.861095 7.005730 7.235361 ...
>>> >> 1000 9.660035 9.189507 9.740223 ...
>>> >> 10000 5.644313 5.898675 5.475838 ...
>>> >> 10001 7.838040 7.564335 8.397569 ...
>>> >>
>>> >> After that, I tried to manipulate the output in order to translate 1,
>>> >> 10,
>>> >> 100, 1000, to the actual names of the genes. And my last resource was
>>> >> to
>>> >> ask here at the forum.
>>> >>
>>> >> It is looking good already. I only need to have an extra column, with
>>> >> the
>>> >> names of the genes.
>>> >>
>>> >> Thanks,
>>> >> Marcelo
>>> >>
>>> >>
>>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>> >> wrote:
>>> >>>
>>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at gmail.com>
>>> >>> wrote:
>>> >>> > Hi Sean,
>>> >>> >
>>> >>> > Thanks for your answer!
>>> >>> >
>>> >>> > That is great already.
>>> >>> >
>>> >>> > I can see the gene's names now:
>>> >>> >
>>> >>> >> library(GEOquery)
>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >>> >> head(fData(gset[[1]]))$Gene
>>> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6
>>> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3
>>> >>> > ADA
>>> >>> > ADAM8 AKT3 ... ZNF254
>>> >>> >
>>> >>> > But the data frame only contains these columns.
>>> >>> >
>>> >>> >> names(fData(gset[[1]]))
>>> >>> > [1] "ID" "Gene" "UniGene" "Description"
>>> >>> > "Ensembl*
>>> >>> > Chr" "Start (bp)"
>>> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID"
>>> >>> >
>>> >>> > Where is the expression information for each gene?
>>> >>>
>>> >>> exprs(gset[[1]])
>>> >>>
>>> >>> gset is an ExpressionSet, so you should read a bit about
>>> >>> ExpressionSets in the Biobase vignette as well as the help page.
>>> >>>
>>> >>> Sean
>>> >>>
>>> >>>
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Marcelo
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hi, Marcelo.
>>> >>> >>
>>> >>> >>
>>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira
>>> >>> >> <marcelops at gmail.com>
>>> >>> >> wrote:
>>> >>> >> > Quick question:
>>> >>> >> >
>>> >>> >> > I am trying to import some GEO datasets, and having some issues
>>> >>> >> > with
>>> >>> >> > the
>>> >>> >> > annotations:
>>> >>> >> >
>>> >>> >> > I can download the GSE dataset using:
>>> >>> >> >
>>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > However, it will return me a ExpressionSet with the following
>>> >>> >> > format:
>>> >>> >> >
>>> >>> >> > X1 X10 X100 X1000 ...
>>> >>> >> > GSM278765
>>> >>> >> > GSM278766
>>> >>> >> > GSM278767
>>> >>> >> > GSM278768
>>> >>> >> > GSM278769
>>> >>> >> > ...
>>> >>> >>
>>> >>> >> This is not what is returned by GEOquery, so you have done some
>>> >>> >> manipulation (looks like you did a transpose on the expression
>>> >>> >> matrix), it seems.
>>> >>> >>
>>> >>> >> > This is pretty much what I need, but I still need to translate
>>> >>> >> > (X1,
>>> >>> >> > X10,
>>> >>> >> > X100, X1000, etc...) to the actual names of the genes.
>>> >>> >>
>>> >>> >> library(GEOquery)
>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]]
>>> >>> >> head(fData(gset))
>>> >>> >>
>>> >>> >> The gene symbols are in the "Gene" column:
>>> >>> >>
>>> >>> >> genesymbols = fData(gset)$Gene
>>> >>> >>
>>> >>> >> Sean
>>> >>> >>
>>> >>> >>
>>> >>> >> >
>>> >>> >> > Any suggestions?
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> > Marcelo
>>> >>> >> >
>>> >>> >> > [[alternative HTML version deleted]]
>>> >>> >> >
>>> >>> >> > _______________________________________________
>>> >>> >> > Bioconductor mailing list
>>> >>> >> > Bioconductor at r-project.org
>>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>> >> > Search the archives:
>>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>> >>
>>> >>> >
>>> >>> > [[alternative HTML version deleted]]
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > Bioconductor mailing list
>>> >>> > Bioconductor at r-project.org
>>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>> > Search the archives:
>>> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>
>>> >>
>>> >
>>
>>
>
More information about the Bioconductor
mailing list