[Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error
jayoung at fhcrc.org
Thu Nov 25 01:01:29 CET 2010
Looks like BSgenome 1.18.2 went live a couple of hours ago - all looks
good for me now.
Thanks very much,
On Nov 23, 2010, at 1:46 PM, Janet Young wrote:
> Hi Herve,
> Thanks very much for this fix, and the other one from the same day
> (the one about space names as factors vs characters in getSeq).
> I haven't been able to test it yet with my real data, as I'm not
> seeing BSgenome 1.18.2 up yet on the bioc website (I'm using
> release, not dev), and I'm also not yet getting it when I use
> update.packages - should it be up already? If there's some kind of
> timelag I should wait for, that's fine, but I'd thought things were
> usually pretty instantaneous (or perhaps overnight), so thought I'd
> better check back with you whether the new version somehow failed to
> I can understand the problem with getSeq - for me, I think your
> workaround #2 will be totally fine (I'd actually started with seqs
> specified in RangedData objects, including start and end coords - I
> only started using names without coords to isolate the problem).
> The option to disable or enable the grep feature sounds like a good
> fix. For me, it seems more natural that the default should be to
> disable grep, but perhaps there too many other users who've got used
> to expecting the grep. Either way, it'd be great to explain it in
> the getSeq help page - I imagine that'll be the case in 1.18.2.
> On Nov 18, 2010, at 10:15 PM, Hervé Pagès wrote:
>> On 11/15/2010 06:51 PM, Janet Young wrote:
>>> Hi again,
>>> I'm interested in some sequences on the cow chrUn scaffolds, and am
>>> having a bit of bother getting them. I think I might have
>>> uncovered a
>>> bug, although I might just be doing something wrong. The code and
>>> below should explain all. Any suggestions?
>>> thanks (again!),
>> Oops, you found another one. More below...
>>> Janet Young
>>> Dr. Janet Young (Trask lab)
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Avenue N., C3-168,
>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>> tel: (206) 667 1471 fax: (206) 667 6524
>>> email: jayoung ...at... fhcrc.org
>>> ###### this works, gives me a 50bp sequence
>>>  "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
>>> ####### for some scaffolds in the chrUn.scaffolds pile, I don't
>>> get an
>>> error message, but getSeqs seems to ignore the start and end
>>> requested - e.g. the sequence returned here is the whole scaffold,
>>> just the first 50bp
>> Yes, the new code was ignoring the 'end'. This is now fixed.
>>> #### for other scaffolds I just get an error message, although the
>>> scaffold definitely exists (is it doing a partial match on the
>>> name, not
>>> an exact match?)
>> Yes it's doing partial matching (it uses grep()). This feature was
>> added a long time ago on a user's request. Note that the BSgenome
>> package for Cow is peculiar with its chrUn.scaffolds multiple
>> object. Multiple sequences objects in the BSgenome packages are
>> generally the upstream* objects which contain long/ugly sequence
>> names like
>> > names(Btaurus$upstream1000)[1:5]
>>  "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
>>  "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
>>  "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
>>  "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
>>  "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"
>> so it seemed convenient to be able to specify only part of a name,
>> at least when working interactively.
>> Currently Cow and Zebrafish are the only BSgenome packages with
>> multiple sequences objects other than the upstream000 objects.
>> Thanks to your feedback, it is now clear that the "grep feature" is
>> not always desirable. Here are 2 workarounds:
>> 1) Modify the regular expression so that it will match the whole
>> > getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
>> 54139-letter "DNAString" instance
>> ('as.character=FALSE' used here only to produce a compact output)
>> Note that this is not completely safe because the dots in
>> "chrUn.004.1022" could match any character. Using "^chrUn\\.004\\.
>> would be safer. But, performance aside, having to escape all special
>> characters in the regex just to have it behave like if fixed=TRUE
>> was passed to the call to grep() doesn't sound very appealing.
>> 2) So here is another one. With BSgenome 1.18.2, the "grep feature"
>> will be disable when you use a high level object like GRanges or
>> RangedData. So you will be able to do something like:
>> > getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1,
>> end=10)), as.character=FALSE)
>> A DNAStringSet instance of length 1
>> width seq
>>  10 AAATCTAGAG
>> But this workaround has its issues too (e.g. if I want to get the
>> entire sequence I need to know its end in advance).
>> OK none of those workarounds is totally satisfying so I'll try to
>> come up with something better for BSgenome 1.18.3 (maybe add an
>> 'exact.match' argument to getSeq() that would be FALSE by default?)
>> Suggestions are welcome.
>> Thanks for your feedback,
>>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],
>>> start[i], :
>>> sequence chrUn.004.1022 found more than once, please use a non-
>>> which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
>>>  1022
>>> grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
>>>  1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229
>>> R version 2.12.0 (2010-10-15)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>  en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>> attached base packages:
>>>  stats graphics grDevices utils datasets methods base
>>> other attached packages:
>>>  BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
>>>  Biostrings_2.18.0 GenomicRanges_1.2.1
>>>  IRanges_1.8.2
>>> loaded via a namespace (and not attached):
>>>  Biobase_2.10.0
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>> Hervé Pagès
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
More information about the Bioc-sig-sequencing