[BioC] Using biomaRt to check probe locations

Mon Aug 11 18:05:29 CEST 2008

Section 5 in the main vignette for the 'altcdfenvs' packages shows
how to quickly check what probesets of interest are "really" made of.
It uses Biostrings, and the code can be edited to suit custom needs.

I have written quickly an example with the probeset mentioned (this is
going through the refseq ID; doing a full-genome match, as James mentions
it, could be interesting to check for possible cross-matches):

library(hgu133plus2.db)

get("1552422_at", hgu133plus2GENENAME)
# [1] "chromosome 10 open reading frame 25"
## Let's see with refseq

refseq_ids <- get("1552422_at", hgu133plus2REFSEQ)

library(biomaRt)
mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

library(altcdfenvs)

getSeq <- function(name) {
     seq <- getSequence(id = name, type = "refseq_dna", seqType = "cdna",
         mart = mart)
     targets <- seq$cdna
     if (is.null(targets))
       return(character(0))
   names(targets) <- paste(seq$refseq_dna, 1:nrow(seq), sep = "-")
   return(targets)
}
targets <- unlist(lapply(refseq_ids,
                         getSeq))

library(hgu133plus2probe)

## taking code from altcdfenvs' matchAffyProbes()
## (only consider PM, this is a demo)

probes <- subset(hgu133plus2probe, Probe.Set.Name == "1552422_at")
stringset <- DNAStringSet(probes$sequence)
pmdict <- PDict(stringset)
targets <- as.list(targets)
for (ii in seq(along = targets)) {
  if (is.na(targets[[ii]])) {
    stop(paste("Target", ii, "is NA."))
  }
  targets[[ii]] <- DNAString(targets[[ii]])
}

m_pm <- vector("list", length = length(targets))
for (ii in seq(along = targets)) {
  md <- matchPDict(pmdict, targets[[ii]])
  m_pm[[ii]] <- md
}

# cheap plot
i <- 1
plot(0, 0, xlim=c(0, targets[[i]]@length), ylim=c(0, length(m_pm[[i]])),
     main=names(targets)[i], type="n")
points(unlist(startIndex(m_pm[[i]])), seq(length(m_pm[[i]])))

> On Mon, Aug 11, 2008 at 9:26 AM, James W. MacDonald
> <jmacdon at med.umich.edu> wrote:
>> Hi Nathan,
>>
>> Nathan Harmston wrote:
>>>
>>> Hi,
>>>
>>> I have just found a few of the probes which I find to be significantly
>>> diff expression in one of my analysis, have no attached genomic
>>> location in the annotation package I am using. So I thought a way
>>> around this would be to query ensembl using biomaRt and retrieve
>>> probe_set locations from ensembl (since I trust Ensembl infinitely
>>> more than Affy), however in the listAttributes(ensembl) I have been
>>> unable to determine the attribute that I require to determine the
>>> start and end location and strand of the actual probe involved,
>>> however all I can get back are transcript locations and gene
>>> locations.
>>>
>>> For example probe: 1552422_at from the hgu133plus2 array.
>>>
>>> Is there an attribute that I am missing that can be used?
>>
>> Not that I know of. Ensembl is pretty gene/transcript centric. If you
>> want
>> the start/stop coordinates (or if you just want to see where the probes
>> map
>> to), there are two ways that I can think of offhand.
>>
>> First, you could use the probe package to output the probe sequences in
>> FASTA format and then upload to Blat at UCSC. You will then get where
>> the
>> probes map to, as well as the ability to look at the locations in the
>> genome
>> browser. This is the easier of the two, especially if I give you the
>> code
>> ;-D
>
>> Second, you could take the probe sequences and use the Biostrings and
>> BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the
>> probes to the genome to get the start and stop coordinates. You could
>> then
>> use the rtracklayer package to put the locations on the genome browser.
>> This
>> will take more work, but will possibly pay dividends in the future if
>> you
>> need to do much mapping of things to the genome.
>>
>> I won't give you code for this, as the Biostrings package is in a state
>> of
>> rapid evolution and what I wrote 6 months ago may not work at all
>> anymore.
>> However, the GenomeSearching vignette (in the BSgenome package) contains
>> the
>> code I used as a template, and should be a good resource.
>
> There is a third method, as Ensembl DOES store the information about
> probe locations.  They are not available via the Biomart interface,
> but they are available for querying directly from the "core"
> databases.  Assuming you want to use the most recent human build, you
> can do this to get all probe mappings for all arrays:
>
>> library(RMySQL)
> Loading required package: DBI
>> con <-
>> dbConnect('MySQL',db='homo_sapiens_core_47_36i',host='ensembldb.ensembl.org',user='anonymous')
>> dat <- dbGetQuery(con,"select op.*,of.*,oa.name from oligo_probe op join
>> oligo_feature of on op.oligo_probe_id=of.oligo_probe_id join oligo_array
>> oa on oa.oligo_array_id=op.oligo_array_id limit 10;") #used limit 10
>> here for speed.
> # you could also specify oa.name = 'HGU_U133A_2', or some other
> #   array type if you want to limit things
>> dat
>    oligo_probe_id oligo_array_id                probeset     name
> description
> 1         6234295             25                89898_at 224:361;
> <NA>
> 2         6234619             24                40196_at 579:607;
> <NA>
> 3         6234619             32                40196_at 579:607;
> <NA>
> 4         6234817             28    Hs.137418.0.A1_3p_at 878:461;
> <NA>
> 5         6234443             31               230040_at  58:459;
> <NA>
> 6         6234443             37               230040_at 590:747;
> <NA>
> 7         6234567             28     Hs.87134.0.A1_3p_at 210:371;
> <NA>
> 8         6235060             28 Hs2.213065.1.S1_3p_s_at 376:253;
> <NA>
> 9         6234839             25                74854_at 396:307;
> <NA>
> 10        6234776             26               218630_at 551:581;
> <NA>
>    length oligo_feature_id seq_region_id seq_region_start seq_region_end
> 1      25         31760350        226060          1783255        1783279
> 2      25         31760351        226053         38000872       38000896
> 3      25         31760351        226053         38000872       38000896
> 4      25         31760352        226056         21841523       21841547
> 5      25         31760353        226036         75873860       75873884
> 6      25         31760353        226036         75873860       75873884
> 7      25         31760354        226034        162916270      162916294
> 8      25         31760355        226055         95978760       95978784
> 9      25         31760356        226043        134008271      134008295
> 10     25         31760357        226033         53638348       53638372
>    seq_region_strand mismatches oligo_probe_id analysis_id           name
> 1                 -1          0        6234295        5112        HG_U95E
> 2                  1          0        6234619        5112        HG_U95A
> 3                  1          0        6234619        5112      HG_U95Av2
> 4                 -1          0        6234817        5112       U133_X3P
> 5                 -1          0        6234443        5112       HG_U133B
> 6                 -1          0        6234443        5112 HG_U133_Plus_2
> 7                  1          0        6234567        5112       U133_X3P
> 8                 -1          0        6235060        5112       U133_X3P
> 9                  1          0        6234839        5112        HG_U95E
> 10                -1          0        6234776        5112     HG_U133A_2
>
> Note that this will leave you with a mess, potentially, as not all of
> the probes from the same probe set necessarily map to the same
> location in the genome.
>
> Hope that helps.
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>