[Bioc-devel] how to get genomic sequences?
Sean Davis
sdavis2 at mail.nih.gov
Wed Mar 28 03:28:20 CEST 2007
Sean Davis wrote:
> Roger Liu wrote:
>
>> Hi,
>>
>> I have a set of data with chromosome number and coordinates of the sequences
>> such as,chr10, start 1000, end 2000.
>> I have tried using biomart to retrieve the genomic sequences for my dataset,
>> but I didn't get success, I used seqType argument as:
>> seqType="genomic", it reported error as"The type of sequence specified with
>> seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
>> but I have seen this "genomic" argument for seqType in the help file. So
>> what's going on there?
>>
>> Or anyone can recommend a package which can help me retrieve the genomic
>> sequences from hg18 with known chromosome number and sequences
>> coordinates(start and end).
>>
>>
> URLs like so:
>
> http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr1:1000,2000;segment=chr2:1000,2000
>
> will get you sequences of interest. These come as simple well-formed
> XML, so you can use the XML package to parse them pretty easily. You
> can do one-at-a-time, or string together a bunch.
>
Just a quick followup and a skeleton function... The function takes a
dataframe with three columns, chrom, start, end and a single character
argument, genome, which needs to match the UCSC genome browser notation
for the genome build. It fetches the das DNA one at a time and returns
the original dataframe and a fourth column with the sequence. There
isn't any error checking, etc., and it fetches one sequence at a time.
Both of these issues could be eliminated with a little work.
getSequences <- function(locs,genome='hg18') {
require(XML)
chrom <- locs$chrom
starts <- locs$start
ends <- locs$end
seqs <- vector('character',length=length(chrom))
for(i in 1:length(chrom)) {
url <-
sprintf('http://genome.ucsc.edu/cgi-bin/das/%s/dna?segment=%s:%d,%d',genome,chrom[i],starts[i],ends[i])
results <- xmlRoot(xmlTreeParse(url))
seqs[i] <- gsub('\n','',xmlValue(results[[1]][[1]]))
}
return(data.frame(chrom=chrom,start=starts,end=ends,sequence=seqs))
}
And you could use it like so:
mySeqs <-
getSequences(data.frame(chrom=c('chr1','chr2'),starts=c(1000,1000),ends=c(2000,1500)))
Herve Page's solution is undoubtedly the better one, but if memory is a
limitation, this solution might be of use.
Sean
More information about the Bioc-devel
mailing list