[BioC] Using biomaRt to check probe locations

James W. MacDonald jmacdon at med.umich.edu
Mon Aug 11 15:26:10 CEST 2008


Hi Nathan,

Nathan Harmston wrote:
> Hi,
> 
> I have just found a few of the probes which I find to be significantly
> diff expression in one of my analysis, have no attached genomic
> location in the annotation package I am using. So I thought a way
> around this would be to query ensembl using biomaRt and retrieve
> probe_set locations from ensembl (since I trust Ensembl infinitely
> more than Affy), however in the listAttributes(ensembl) I have been
> unable to determine the attribute that I require to determine the
> start and end location and strand of the actual probe involved,
> however all I can get back are transcript locations and gene
> locations.
> 
> For example probe: 1552422_at from the hgu133plus2 array.
> 
> Is there an attribute that I am missing that can be used?

Not that I know of. Ensembl is pretty gene/transcript centric. If you 
want the start/stop coordinates (or if you just want to see where the 
probes map to), there are two ways that I can think of offhand.

First, you could use the probe package to output the probe sequences in 
FASTA format and then upload to Blat at UCSC. You will then get where 
the probes map to, as well as the ability to look at the locations in 
the genome browser. This is the easier of the two, especially if I give 
you the code ;-D

blatGene <- function(affyid, probe, filename){
     ## affyid == Affy probeset ID
     ## probe == BioC probe package name
     ## filename == output file name
     require(probe, quietly = TRUE, character.only = TRUE)
     tmp <- data.frame(get(probe))
     if(length(affyid) > 1){
         seqnc <- vector()
         for(i in seq(along = affyid))
             seqnc <- c(seqnc, tmp[tmp$Probe.Set.Name == affyid[i], 1])
     }else{
         seqnc <- tmp[tmp$Probe.Set.Name == affyid,1]
     }
     out <- vector()
     if(length(seqnc) > 25) warning("Blat will only return values for 25 
or fewer sequences!",
                                    call. = FALSE)
     for(i in seq(along = seqnc)) out <- rbind(out, rbind(paste("> 
Probe", i, sep=""), seqnc[i]))
     write.table(out, filename, sep="\t", quote=FALSE, row.names=FALSE, 
col.names=FALSE)
}

Second, you could take the probe sequences and use the Biostrings and 
BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the 
probes to the genome to get the start and stop coordinates. You could 
then use the rtracklayer package to put the locations on the genome 
browser. This will take more work, but will possibly pay dividends in 
the future if you need to do much mapping of things to the genome.

I won't give you code for this, as the Biostrings package is in a state 
of rapid evolution and what I wrote 6 months ago may not work at all 
anymore. However, the GenomeSearching vignette (in the BSgenome package) 
contains the code I used as a template, and should be a good resource.

Best,

Jim


> 
> Also just out of interest, does getBM have the 3 second break
> automatically or not? or does it use one big query? I ve only been
> using 1 test probe, and would like to make sure I am not going to get
> myself and my university blocked.
> 
> Many thanks in advance,
> 
> 
> Nathan
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-0646
734-936-8662



More information about the Bioconductor mailing list