[BioC] How to get a unique line of annotation for each specific genomic position by using biomaRt package
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Feb 8 15:59:35 CET 2011
Hi,
On Tue, Feb 8, 2011 at 9:24 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
> Dear Steve,
>
> Thanks for your kindness. Could you please give me more directions on
> this annotation problem?
>
> #########################
> (1)
> #########################
> I want each my SNP has just one line of annotation in separate
> columns. If there are the multiple terms for the same attributes (for
> example, multiple go terms are shared at that location), I would like
> to include them in the same column with symbols (such ; : | )
> separated each of them.
>
> for example I have SNPs like this:
> # SNPs,chr,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> I would have annotations like this:
> # SNPs,chr,start,end,go_term
> SNP_1,1,43,43,go_1:go_3
> SNP_2,2,56,56,go_100:go_1000
I'll give you this one ... continuing from my previous example: say
the getBM call stores its return value in `result`:
library(plyr)
summary <- ddply(result, .(chromosome_name, start_position), function(x) {
new.x <- x[1,]
new.x$go_biological_process_id <- paste(x$go_biological_process_id,
collapse="|")
new.x
})
I'll leave the rest as an exercise for you.
-steve
>
> #########################
> (2)
> #########################
> Alternatively, I would like to have the SNPs position be combined with
> its annotations results, so as to know which the annotation lines are
> corresponding to. I do not know how to do that using bioconductor
> packages. Look the example followed:
>
> for example I have SNPs like this:
> # SNPs,chr,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> I would have annotations like this:
> # SNPs,chr,start,end,go_term
> SNP_1,1,43,43,go_1
> SNP_1,1,43,43,go_3
> SNP_2,2,56,56,go_100
> SNP_2,2,56,56,go_1000
>
> Jian-Feng,
>
> 2011/2/8 Steve Lianoglou <mailinglist.honeypot at gmail.com>:
>> Hi,
>>
>> On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
>>> Dear listers,
>>>
>>> I am new to bioconductor.
>>>
>>> I have genomic variations (SNP, indel, CNV) coordinated by
>>> chromosome:start:end in GFF/BED/VCF format. One genomic variation is
>>> defined a specific genomic position (in base pair).
>>>
>>> for example:
>>> # SNPs,chr,start,end
>>> SNP_1,1,43,43
>>> SNP_2,2,56,56
>>>
>>> I would like to get such genomic variations annotated by various
>>> gen/protein/passway centric annotations (as listed in BioMart
>>> databases). I tried R/bioconductor biomaRt package. But, I failed to
>>> get a unique line of annotation for a specific genomic position. Could
>>> you please give any directions on that?
>>
>> Could you explain a bit more about what you mean when you say "get a
>> unique line of annotation"?
>>
>> The only informative info `getBM` query is returning is the gene id
>> for the location, and the GO term evidence code
>> (go_biological_process_linkage_type). If you add, say,
>> "go_biological_process_id", you get the biological go terms associated
>> with the position, ie:
>>
>> result <- getBM(attributes=c("chromosome_name","start_position","ensembl_gene_id",
>> "go_biological_process_linkage_type", "go_biological_process_id"),
>> filters = c("chromosome_name", "start", "end"),
>> values = list(chr, start, end), mart=alyr, uniqueRows = TRUE)
>>
>> If you problem is that some positions have more than one row, like so:
>>
>> chromosome_name start_position ensembl_gene_id ...
>> go_biological_process_id
>> 1 33055 scaffold_100013.1
>> GO:0006355
>> 1 33055 scaffold_100013.1
>> GO:0006886
>> 1 33055 scaffold_100013.1
>> GO:0006913
>> 1 33055 scaffold_100013.1
>> GO:0007165
>> 1 33055 scaffold_100013.1
>> GO:0007264
>>
>> this happens because multiple go terms are shared at that location. If
>> you want to just pick one, but you'll have to decide how you want to
>> do that.
>>
>> If you want to somehow summarize each chromosome/start_position into
>> one row, you can iterate over the data by this combination easily
>> with, say, the ddply function from the plyr package:
>>
>> library(plyr)
>> summary <- ddply(result, .(chromosome_name, start_position), function(x) {
>> # x will have all of the rows for a given chromosome_name / start_position
>> # combo. We can arbitrarily just return the first row, but you'll likely
>> # want to do something smarter:
>> x[1,]
>> })
>>
>> If you look at `summary`, you'll have one row per position.
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>
> --
> Jian-Feng, Mao
>
> the Institute of Botany,
> Chinese Academy of Botany,
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list