[BioC] Using biomaRt to check probe locations
James W. MacDonald
jmacdon at med.umich.edu
Mon Aug 11 15:26:10 CEST 2008
Hi Nathan,
Nathan Harmston wrote:
> Hi,
>
> I have just found a few of the probes which I find to be significantly
> diff expression in one of my analysis, have no attached genomic
> location in the annotation package I am using. So I thought a way
> around this would be to query ensembl using biomaRt and retrieve
> probe_set locations from ensembl (since I trust Ensembl infinitely
> more than Affy), however in the listAttributes(ensembl) I have been
> unable to determine the attribute that I require to determine the
> start and end location and strand of the actual probe involved,
> however all I can get back are transcript locations and gene
> locations.
>
> For example probe: 1552422_at from the hgu133plus2 array.
>
> Is there an attribute that I am missing that can be used?
Not that I know of. Ensembl is pretty gene/transcript centric. If you
want the start/stop coordinates (or if you just want to see where the
probes map to), there are two ways that I can think of offhand.
First, you could use the probe package to output the probe sequences in
FASTA format and then upload to Blat at UCSC. You will then get where
the probes map to, as well as the ability to look at the locations in
the genome browser. This is the easier of the two, especially if I give
you the code ;-D
blatGene <- function(affyid, probe, filename){
## affyid == Affy probeset ID
## probe == BioC probe package name
## filename == output file name
require(probe, quietly = TRUE, character.only = TRUE)
tmp <- data.frame(get(probe))
if(length(affyid) > 1){
seqnc <- vector()
for(i in seq(along = affyid))
seqnc <- c(seqnc, tmp[tmp$Probe.Set.Name == affyid[i], 1])
}else{
seqnc <- tmp[tmp$Probe.Set.Name == affyid,1]
}
out <- vector()
if(length(seqnc) > 25) warning("Blat will only return values for 25
or fewer sequences!",
call. = FALSE)
for(i in seq(along = seqnc)) out <- rbind(out, rbind(paste(">
Probe", i, sep=""), seqnc[i]))
write.table(out, filename, sep="\t", quote=FALSE, row.names=FALSE,
col.names=FALSE)
}
Second, you could take the probe sequences and use the Biostrings and
BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the
probes to the genome to get the start and stop coordinates. You could
then use the rtracklayer package to put the locations on the genome
browser. This will take more work, but will possibly pay dividends in
the future if you need to do much mapping of things to the genome.
I won't give you code for this, as the Biostrings package is in a state
of rapid evolution and what I wrote 6 months ago may not work at all
anymore. However, the GenomeSearching vignette (in the BSgenome package)
contains the code I used as a template, and should be a good resource.
Best,
Jim
>
> Also just out of interest, does getBM have the 3 second break
> automatically or not? or does it use one big query? I ve only been
> using 1 test probe, and would like to make sure I am not going to get
> myself and my university blocked.
>
> Many thanks in advance,
>
>
> Nathan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-0646
734-936-8662
More information about the Bioconductor
mailing list