[BioC] How to retrieve gene information
Martin Morgan
mtmorgan at fhcrc.org
Thu Dec 27 19:35:16 CET 2007
Hi Allen --
affy snp wrote:
> Hi Martin,
>
> Thanks for brining this to my attention.
> I am wondering:
>
> (1) Should the location information only be specified
> in a way using 'e'? Will sth like 1234567 work?
There's nothing magical going on. 'filt' just compares numeric values in
'loc' to another numeric value. Any numeric comparison is possible.
> (2) While the result return the gene function, can it also include
> the gene symbol?
The function returns a list, with names of the list elements being the
Entrez ids. There are other maps in the org.* packages to get at other
types of information, for instance org.Hs.egSYMBOL to get 'symbol'
values associated with each Entrez id. List available maps with
> library(org.Hs.eg.db)
> ls("package:org.Hs.eg.db")
and find out about each with
> ?org.Hs.egSYMBOL
> (3) In case I want to look up genes of multiple chromosomal regions,
> how easily can I do that if I supply with a table of regions such as:
>
> chr Start End
> 1 2171559 245522847
> 1 77465510 82255496
> 2 1 243018229
> 2 145329129 187974870
> 3 1 59538720
Pretty easily, though now you need to specify your problem a little more
fully, and write some R code to do what you want. One possibility is
that you would like the gene symbols satisfying all conditions within a
row (chr=1 & loc>s2711559 & end<245522847), but for any of the rows. A
first approach might start by generalizing 'filt'
> filt <- function(eid, czome, start, end) {
+ loc <- abs(eid) # either strand
+ any(names(eid)==czome & loc>start & loc<end)
+ }
and putting the code to find Entrez ids into a function
> eids <- function(crit, map) {
+ found <- unlist(eapply(map, filt,
+ czome=crit[["chr"]],
+ start=crit[["start"]],
+ end=crit[["end"]]))
+ ls(map[found])
+ }
'crit' is the criterion we'll use to select entrez ides from 'map', it
will be a named vector with elements 'chr', 'start', 'end'.
We'll 'apply' our 'eids' function to each row of a data frame
representing your criteria:
> df <- data.frame(chr=c(1,1,2,2,3),
+ start=c(2171559, 77465510, 1, 145329129, 1),
+ end=c(245522847, 82255496, 245522847, 187974870, 59538720))
> res <- apply(df, 1, eids, map=org.Hs.egCHRLOC)
> str(res)
List of 5
$ : chr [1:1833] "24" "27" "34" "58" ...
$ : chr [1:13] "5737" "8880" "10561" "10964" ...
$ : chr [1:1165] "14" "33" "52" "72" ...
$ : chr [1:162] "90" "92" "390" "518" ...
$ : chr [1:425] "30" "93" "95" "211" ...
The result is a list, with each element corresponding to the Entrez ids
found in the corresponding row of 'df'.
It turns out that there are
> sum(sapply(res, length))
[1] 3598
Entrez IDs but some Entrez ids occur in more than one region
> length(unique(unlist(res)))
[1] 3423
You could find the GENENAMEs for all of these with
> nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")
Depending on what you want to do in the future, you might also think
about making these into GeneSets, e.g.,
> library(GSEABase)
> setNames <- apply(df, 1, paste, collapse="_")
> gss <- mapply(GeneSet, res, setName=setNames,
+ MoreArgs=list(
+ geneIdType=EntrezIdentifier(),
+ collectionType=ComputedCollection()))
> gsc <- GeneSetCollection(gss)
> gsc
GeneSetCollection
names: 1_2171559_245522847, 1_77465510_82255496, ..., 3_1_59538720 (5
total)
unique identifiers: 24, 27, ..., 727811 (3423 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: ComputedCollection (1 total)
Martin
>
> Thank you and happy holiday!
>
> Allen
>
> On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> Hi Allen --
>
> Also the 'org.*' annotation packages provide organism-centric
> annotations. These packages have environment-like key-value structures
> that typically map Entrez identifiers to other information. The
> CHRLOC
> variables map Entrez ids to named integer vectors. Names on the vector
> are chromosome locations, values are the strand (+ or -) and base
> start position. Thus
>
> > library(annotate)
> > library(org.Hs.eg.db )
> >
> > filt <- function(eid) {
> + loc <- abs(eid) # either strand
> + any(names(eid)=="3" & loc>1e8 & loc<1.1e8)
> + }
> > found <- unlist(eapply(org.Hs.egCHRLOC , filt))
> > sum(found) # named genes found
> [1] 33
> >
> > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids
> > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6
> $`214`
> [1] "activated leukocyte cell adhesion molecule"
>
> $`868`
> [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b"
>
> $`961`
> [1] "CD47 molecule"
>
> $`1295`
> [1] "collagen, type VIII, alpha 1"
>
> $`6152`
> [1] "ribosomal protein L24"
>
> $`9666`
> [1] "zinc finger DAZ interacting protein 3"
>
> The annotation packages are from data sources taken from a
> snapshot of
> data sources (see org.Hs.eg_dbInfo()) that might differ from the data
> sources used for biomaRt; the merits of this include (a)
> reproduciblilty and (b) relative speed of computation (no internet
> download required).
>
> Martin
>
>
> toedling at ebi.ac.uk <mailto:toedling at ebi.ac.uk> writes:
>
> > Hi Allen,
> >
> > have a look at the biomaRt package:
> > http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html
> > and see the vignette section 4.5 Task 5.
> > You may want to select additional attributes returned, such as
> > "description" etc. Use the "listAttributes" function to obtain
> of list of
> > possible attributes.
> >
> > Best and Merry Christmas,
> > Joern
> >
> >> Dear list,
> >>
> >> I am wondering if I supply with chromosome information, start
> and end
> >> position
> >> of a region, can any Bioconductor package could return a list
> of genes
> >> with
> >> short descriptions within those regions?
> >>
> >> Thanks and Merry Christmas!
> >>
> >> Allen
> >>
> >>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
>
More information about the Bioconductor
mailing list