[BioC] creating GSEA files using biomart
Juliet Hannah
juliet.hannah at gmail.com
Thu Sep 13 18:08:32 CEST 2012
Thanks Steffen for the helpful answers. "description", how embarrassing!
On Thu, Sep 13, 2012 at 11:42 AM, Steffen Durinck
<durinck.steffen at gene.com> wrote:
> Hi Juliet,
>
> The third attribute you're looking for is 'description':
>
> idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol","description"),
> filters ="affy_hg_u133a",values = probeSets, mart = ensembl)
>
> Gives:
>
> affy_hg_u133a hgnc_symbol
> description
> 1 219666_at MS4A6A membrane-spanning 4-domains,
> subfamily A, member 6A [Source:HGNC Symbol;Acc:13375]
> 2 220547_s_at FAM35B family with sequence
> similarity 35, member B [Source:HGNC Symbol;Acc:31425]
> 3 218034_at FIS1 fission 1 (mitochondrial outer membrane) homolog
> (S. cerevisiae) [Source:HGNC Symbol;Acc:21689]
> 4 220547_s_at FAM35B2 family with sequence similarity 35, member
> B2 (pseudogene) [Source:HGNC Symbol;Acc:34038]
> 5 220547_s_at FAM35A family with sequence
> similarity 35, member A [Source:HGNC Symbol;Acc:28773]
>
>
> There is no systematic way to figure out with attribute name you need to use
> all you have is the attribute name and a description of the attribute. The
> more you get used to looking at those, the easier it gets to figure out
> which one you need and once you know the attributes you need, often you'll
> be using a similar set of attributes most of the time
>
>
> It is interesting to see in your example that one probeset maps to three
> different but closely related genes. In the past I thought Ensembl would
> remove such unambiguous mappers. I think the best to do in this case is to
> remove all probes that map to multiple genes as there is no way to tell
> which gene you'll be measuring. I'll report this example to the Ensembll
> team as they used to do this for us.
>
> Cheers,
> Steffen
>
> On Thu, Sep 13, 2012 at 8:29 AM, Juliet Hannah <juliet.hannah at gmail.com>
> wrote:
>>
>> All,
>>
>> I am trying to create the GSEA chip file. This example uses Affy data,
>> and the chip file is already available. I'm
>> doing this as an exercise in preparation for other platforms.
>>
>> The chip file should look like:
>>
>>
>> Probe Set ID Gene Symbol Gene Title
>> 244901_at ORF25 hypothetical protein
>> 244902_at NAD4L NADH dehydrogenase subunit 4L
>> 244912_at CCB382 cytochrome c biogenesis orf382
>> 244919_at CCB203 cytochrome c biogenesis orf203
>> 244925_at NAD7 NADH dehydrogenase subunit 7
>>
>> How can I obtain the third column from biomart. I tried searching the
>> attributes, but couldn't find the right name. Is it a matter of trial
>> and error to find the correct attribute, or
>> are there systematic ways to find it. Here is what I have so far:
>>
>> library("biomaRt")
>> probeSets <- c("219666_at", "220547_s_at", "218034_at")
>>
>> ensembl = useMart("ensembl")
>> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>>
>> idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol"), filters
>> = "affy_hg_u133a",values = probeSets, mart = ensembl)
>>
>>
>> Also, does anyone have any suggestions regarding how to handle the
>> duplicates (seen in this example) with respect to GSEA.
>>
>> Thanks,
>>
>> Juliet Hannah
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list