[BioC] Using biomaRt to check probe locations
Sean Davis
sdavis2 at mail.nih.gov
Mon Aug 11 15:54:00 CEST 2008
On Mon, Aug 11, 2008 at 9:26 AM, James W. MacDonald
<jmacdon at med.umich.edu> wrote:
> Hi Nathan,
>
> Nathan Harmston wrote:
>>
>> Hi,
>>
>> I have just found a few of the probes which I find to be significantly
>> diff expression in one of my analysis, have no attached genomic
>> location in the annotation package I am using. So I thought a way
>> around this would be to query ensembl using biomaRt and retrieve
>> probe_set locations from ensembl (since I trust Ensembl infinitely
>> more than Affy), however in the listAttributes(ensembl) I have been
>> unable to determine the attribute that I require to determine the
>> start and end location and strand of the actual probe involved,
>> however all I can get back are transcript locations and gene
>> locations.
>>
>> For example probe: 1552422_at from the hgu133plus2 array.
>>
>> Is there an attribute that I am missing that can be used?
>
> Not that I know of. Ensembl is pretty gene/transcript centric. If you want
> the start/stop coordinates (or if you just want to see where the probes map
> to), there are two ways that I can think of offhand.
>
> First, you could use the probe package to output the probe sequences in
> FASTA format and then upload to Blat at UCSC. You will then get where the
> probes map to, as well as the ability to look at the locations in the genome
> browser. This is the easier of the two, especially if I give you the code
> ;-D
> Second, you could take the probe sequences and use the Biostrings and
> BSgenome.Hsapiens.NCBI.b36v3 (or the UCSC.hg18, your choice) to map the
> probes to the genome to get the start and stop coordinates. You could then
> use the rtracklayer package to put the locations on the genome browser. This
> will take more work, but will possibly pay dividends in the future if you
> need to do much mapping of things to the genome.
>
> I won't give you code for this, as the Biostrings package is in a state of
> rapid evolution and what I wrote 6 months ago may not work at all anymore.
> However, the GenomeSearching vignette (in the BSgenome package) contains the
> code I used as a template, and should be a good resource.
There is a third method, as Ensembl DOES store the information about
probe locations. They are not available via the Biomart interface,
but they are available for querying directly from the "core"
databases. Assuming you want to use the most recent human build, you
can do this to get all probe mappings for all arrays:
> library(RMySQL)
Loading required package: DBI
> con <- dbConnect('MySQL',db='homo_sapiens_core_47_36i',host='ensembldb.ensembl.org',user='anonymous')
> dat <- dbGetQuery(con,"select op.*,of.*,oa.name from oligo_probe op join oligo_feature of on op.oligo_probe_id=of.oligo_probe_id join oligo_array oa on oa.oligo_array_id=op.oligo_array_id limit 10;") #used limit 10 here for speed.
# you could also specify oa.name = 'HGU_U133A_2', or some other
# array type if you want to limit things
> dat
oligo_probe_id oligo_array_id probeset name description
1 6234295 25 89898_at 224:361; <NA>
2 6234619 24 40196_at 579:607; <NA>
3 6234619 32 40196_at 579:607; <NA>
4 6234817 28 Hs.137418.0.A1_3p_at 878:461; <NA>
5 6234443 31 230040_at 58:459; <NA>
6 6234443 37 230040_at 590:747; <NA>
7 6234567 28 Hs.87134.0.A1_3p_at 210:371; <NA>
8 6235060 28 Hs2.213065.1.S1_3p_s_at 376:253; <NA>
9 6234839 25 74854_at 396:307; <NA>
10 6234776 26 218630_at 551:581; <NA>
length oligo_feature_id seq_region_id seq_region_start seq_region_end
1 25 31760350 226060 1783255 1783279
2 25 31760351 226053 38000872 38000896
3 25 31760351 226053 38000872 38000896
4 25 31760352 226056 21841523 21841547
5 25 31760353 226036 75873860 75873884
6 25 31760353 226036 75873860 75873884
7 25 31760354 226034 162916270 162916294
8 25 31760355 226055 95978760 95978784
9 25 31760356 226043 134008271 134008295
10 25 31760357 226033 53638348 53638372
seq_region_strand mismatches oligo_probe_id analysis_id name
1 -1 0 6234295 5112 HG_U95E
2 1 0 6234619 5112 HG_U95A
3 1 0 6234619 5112 HG_U95Av2
4 -1 0 6234817 5112 U133_X3P
5 -1 0 6234443 5112 HG_U133B
6 -1 0 6234443 5112 HG_U133_Plus_2
7 1 0 6234567 5112 U133_X3P
8 -1 0 6235060 5112 U133_X3P
9 1 0 6234839 5112 HG_U95E
10 -1 0 6234776 5112 HG_U133A_2
Note that this will leave you with a mess, potentially, as not all of
the probes from the same probe set necessarily map to the same
location in the genome.
Hope that helps.
Sean
More information about the Bioconductor
mailing list