[BioC] How to retrieve gene information

Fri Dec 28 01:36:21 CET 2007

Hi Allen --

> syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL")

The way to understand this is to start with the help page for lookUp

> ?lookUp

and to experiment with a single identifier, e.g.,

> id <- res[[1]][[1]]

> lookUp(id, "org.Hs.eg.db", "GENENAME")
$`24`
[1] "ATP-binding cassette, sub-family A (ABC1), member 4"

What I want to change is 'SYMBOL', i.e., the 'what' argument of
'lookUp', rather than the the data package (org.Hs.eg.db) that I want
to look up the information in, so I try

> lookUp(id, "org.Hs.eg.db", "SYMBOL")
$`24`
[1] "ABCA4"

Looks good, so I'll move on to trying to look up all entries from a
single element of my list, lets choose a short one, and use unlist in
the final line to print out more compactly...

> sapply(res, length)
[1] 1833   13 1165  162  425
> ids <- res[[2]]
> unlist(lookUp(ids, "org.Hs.eg.db", "SYMBOL"))
    5737     8880    10561    10964    11080    23032    23266    26009 
 "PTGFR"  "FUBP1"  "IFI44" "IFI44L" "DNAJB4"  "USP33"  "LPHN2"   "ZZZ3" 
   26289    54810    64123    91624   374986 
   "AK5"  "GIPC2"  "ELTD1"   "NEXN" "FAM73A" 

Note that 'lookUp' automatically looking up a vector of
ids. Nice. Finally, we want to look up each element of 'nms'. We use
lapply, which can take a list as its first argument, and a function as
its second argument. Each element of the list becomes the first
arugment of the function. Subsequent arguments to the function can be
applied after the function is given (I'm getting this from looking at
the help page for lapply). This gets me to

> syms <- lapply(res, lookUp, "org.Hs.eg.db", "SYMBOL")

If I wanted to replace the 'names' of one of the list elements in res,
I might

> names(nms[[1]]) <- syms[[1]]

or more cleverly for all elements of nms

> nms <- mapply("names<-", nms, sysms)

Can you see how this works?

Not sure how you know that some genes are 'hypothetical'. If it's by
looking at their description, then something like

> grep("^hypothetical", nms[[1]])
 [1] 1079 1464 1466 1555 1584 1614 1710 1748 1781 1807 1810 1828

will tell you the index of names you'd like to remove. You could do
something like

> n1 <- nms[[1]]
> idx <- grep("^hypothetical", n1)
> nms[[1]] <- n1[-idx]

How would you write that to apply to all the elements in your 'nms'
list? What if a list element has no hypothetical gene names?

Sorry about ComputedCollection; it's in the development version of
GSEABase; just drop it from the argument list.

Martin

"affy snp" <affysnp at gmail.com> writes:

> Hi Martin,
> Thank you!
> I tried the code below as you suggested:
> ####################################################################
> library(annotate)
> library(org.Hs.eg.db)
> df <- data.frame(chr=c(1,1,2,2,3),
>         start=c(2171559, 77465510, 1, 145329129, 1),
>         end=c(245522847, 82255496, 245522847, 187974870, 59538720))
> filt <- function(eid, czome, start, end) {
>      loc <- abs(eid) # either strand
>      any(names(eid)==czome & loc>start & loc<end)
>  }
> eids <- function(crit, map) {
>      found <- unlist(eapply(map, filt,
>                             czome=crit[["chr"]],
>                             start=crit[["start"]],
>                             end=crit[["end"]]))
>      ls(map[found])
>  }
> res <- apply(df, 1, eids, map=org.Hs.egCHRLOC)
> nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")
> #############################################################
> But I have no idea how to apply 'org.Hs.egSYMBOL' correctly. I tried
> to replace 'org.Hs.eg.db' with 'org.Hs.egSYMBOL' but it was not the
> case.
> The several of last lines in nms are like:
> [[5]]$`643853`
> [1] "similar to F40B5.2b"
> [[5]]$`646424`
> [1] "serine peptidase inhibitor, Kazal type 8 (putative)"
> [[5]]$`646498`
> [1] "hypothetical LOC646498"
> [[5]]$`727811`
> [1] "similar to chemokine (C-C motif) receptor-like 2"
> Again, I would like to have the ID replaced by Gene Symbol.  But just don't know
> how to use 'org.Hs.egSYMBOL'. I noticed there are some genes assigned as 'hypothetical'
> predicted genes. Is there a way to filter out those genes?
> With regard to 'GSEABase' package, I installed the package and tried
> the code:
> gss <- mapply(GeneSet, res, setName=setNames,
>                MoreArgs=list(
>                  geneIdType=EntrezIdentifier(),
>                  collectionType=ComputedCollection()))
> But an error popped up as:
>> library(GSEABase)
>> gss <- mapply(GeneSet, res, setName=setNames,
> +                MoreArgs=list(
> +                  geneIdType=EntrezIdentifier(),
> +                  collectionType=ComputedCollection()))
> Error in mapply(GeneSet, res, setName = setNames, MoreArgs = list(geneIdType = EntrezIdentifier(),  :
>   could not find function "ComputedCollection"
> I appreciate!
> Good evening!
> Allen
>
>
> On Dec 27, 2007 1:35 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>           Hi Allen --     
>
>                affy snp wrote:     > Hi Martin,     >     > Thanks for brining this to my attention.     > I am wondering:     >     > (1) Should the location information only be specified     > in a way using 'e'?
>      Will sth like 1234567 work?     
>
>
>      There's nothing magical going on. 'filt' just compares numeric values in     'loc' to another numeric value. Any numeric comparison is possible.     
>
>                > (2) While the result return the gene function, can it also include     > the gene symbol?     
>      
>
>
>      The function returns a list, with names of the list elements being the     Entrez ids. There are other maps in the org.* packages to get at other     types of information, for instance org.Hs.egSYMBOL to get
>      symbol'     values associated with each Entrez id. List available maps with     
>
>                 > library(org.Hs.eg.db)     
>
>
>       > ls("package: org.Hs.eg.db")     
>      and find out about each with     
>       > ?org.Hs.egSYMBOL     
>
>           > (3) In case I want to look up genes of multiple chromosomal regions,     > how easily can I do that if I supply with a table of regions such as:     >     > chr      Start                End     > 1    
>      2171559         245522847     > 1     77465510       82255496     > 2     1                   243018229     > 2     145329129     187974870     > 3     1                    59538720     
>      
>
>
>      Pretty easily, though now you need to specify your problem a little more     fully, and write some R code to do what you want. One possibility is     that you would like the gene symbols satisfying all conditions
>      within a     row (chr=1 & loc>s2711559 & end<245522847), but for any of the rows. A     first approach might start by generalizing 'filt'     
>       > filt <- function(eid, czome, start, end) {     
>
>           +     loc <- abs(eid) # either strand     
>
>
>      +     any(names(eid)==czome & loc>start & loc<end)     + }     
>      and putting the code to find Entrez ids into a function     
>       > eids <- function(crit, map) {     +     found <- unlist(eapply(map, filt,     +                            czome=crit[["chr"]],     +                            start=crit[["start"]],     +                    
>             end=crit[["end"]]))     +     ls(map[found])     + }     
>      'crit' is the criterion we'll use to select entrez ides from 'map', it     will be a named vector with elements 'chr', 'start', 'end'.     
>      We'll 'apply' our 'eids' function to each row of a data frame     representing your criteria:     
>       > df <- data.frame(chr=c(1,1,2,2,3),     +         start=c(2171559, 77465510, 1, 145329129, 1),     +         end=c(245522847, 82255496, 245522847, 187974870, 59538720))      > res <- apply(df, 1, eids,
>      map=org.Hs.egCHRLOC)      > str(res)     List of 5      $ : chr [1:1833] "24" "27" "34" "58" ...      $ : chr [1:13] "5737" "8880" "10561" "10964" ...      $ : chr [1:1165] "14" "33" "52" "72" ...      $ : chr
>      [1:162] "90" "92" "390" "518" ...      $ : chr [1:425] "30" "93" "95" "211" ...     
>      The result is a list, with each element corresponding to the Entrez ids     found in the corresponding row of 'df'.     
>      It turns out that there are     
>       > sum(sapply(res, length))     [1] 3598     
>      Entrez IDs but some Entrez ids occur in more than one region     
>       > length(unique(unlist(res)))     [1] 3423     
>      You could find the GENENAMEs for all of these with     
>       > nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")     
>      Depending on what you want to do in the future, you might also think     about making these into GeneSets, e.g.,     
>       > library(GSEABase)      > setNames <- apply(df, 1, paste, collapse="_")      > gss <- mapply(GeneSet, res, setName=setNames,     +               MoreArgs=list(     +                
>      geneIdType=EntrezIdentifier(),     +                 collectionType=ComputedCollection()))      > gsc <- GeneSetCollection(gss)      > gsc     GeneSetCollection      names: 1_2171559_245522847,
>      1_77465510_82255496, ..., 3_1_59538720 (5     total)      unique identifiers: 24, 27, ..., 727811 (3423 total)      types in collection:        geneIdType: EntrezIdentifier (1 total)        collectionType:
>      ComputedCollection (1 total)     
>      Martin     
>
>                >     > Thank you and happy holiday!     >     > Allen     >     > On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at fhcrc.org     
>
>           
>           > <mailto:mtmorgan at fhcrc.org>> wrote:     >     >     Hi Allen --     >     >     Also the 'org.*' annotation packages provide organism-centric     >     annotations. These packages have environment-like
>      key-value structures     >     that typically map Entrez identifiers to other information. The     >     CHRLOC     >     variables map Entrez ids to named integer vectors. Names on the vector     >     are
>      chromosome locations, values are the strand (+ or -) and base     >     start position. Thus     >     >     > library(annotate)     >     > library(org.Hs.eg.db )     >     >     >     > filt <- function(eid) {     >
>          +     loc <- abs(eid) # either strand     >     +     any(names(eid)=="3" & loc>1e8 & loc<1.1e8)     >     + }     >     > found <- unlist(eapply(org.Hs.egCHRLOC , filt))     >     > sum(found) # named genes
>      found     >     [1] 33     >     >     >     > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids     >     > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6     >     $`214`     >     [1]
>      "activated leukocyte cell adhesion molecule"     >     >     $`868`     >     [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b"     >     >     $`961`     >     [1] "CD47 molecule"     >     >  
>        $`1295`     >     [1] "collagen, type VIII, alpha 1"     >     >     $`6152`     >     [1] "ribosomal protein L24"     >     >     $`9666`     >     [1] "zinc finger DAZ interacting protein 3"     >     >    
>      The annotation packages are from data sources taken from a     >     snapshot of     >     data sources (see org.Hs.eg_dbInfo()) that might differ from the data     >     sources used for biomaRt; the merits of
>      this include (a)     >     reproduciblilty and (b) relative speed of computation (no internet     >     download required).     >     >     Martin     >     >     
>
>           >     toedling at ebi.ac.uk <mailto:toedling at ebi.ac.uk> writes:     >     >     > Hi Allen,     >     >     >     > have a look at the biomaRt package:     >     >
>      http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html     >     > and see the vignette section 4.5 Task 5.     >     > You may want to select additional attributes returned, such as     >     >
>      "description" etc. Use the "listAttributes" function to obtain     >     of list of     >     > possible attributes.     >     >     >     > Best and Merry Christmas,     >     > Joern     >     >     >     >>
>      Dear list,     >     >>     >     >> I am wondering if I supply with chromosome information, start     >     and end     >     >> position     >     >> of a region, can any Bioconductor package could return a list     >
>          of genes     >     >> with     >     >> short descriptions within those regions?     >     >>     >     >> Thanks and Merry Christmas!     >     >>     >     >> Allen     >     >>     >     >>     >     >     >
>          > _______________________________________________     >     > Bioconductor mailing list     >     > Bioconductor at stat.math.ethz.ch     
>
>
>      >     <mailto:Bioconductor at stat.math.ethz.ch>     
>
>           
>           >     > https://stat.ethz.ch/mailman/listinfo/bioconductor     >     > Search the archives:     >     http://news.gmane.org/gmane.science.biology.informatics.conductor     >     >     --     >     Martin
>      Morgan     >     Computational Biology / Fred Hutchinson Cancer Research Center     >     1100 Fairview Ave. N.     >     PO Box 19024 Seattle, WA 98109     >     >     Location: Arnold Building M2 B169     >    
>      Phone: (206) 667-2793     >     >     
>      
>
>
>
>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793