[Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error

Thu Nov 25 01:01:29 CET 2010

Hi,

Looks like BSgenome 1.18.2 went live a couple of hours ago - all looks  
good for me now.

Thanks very much,

Janet

On Nov 23, 2010, at 1:46 PM, Janet Young wrote:

> Hi Herve,
>
> Thanks very much for this fix, and the other one from the same day  
> (the one about space names as factors vs characters in getSeq).
>
> I haven't been able to test it yet with my real data, as I'm not  
> seeing BSgenome 1.18.2 up yet on the bioc website (I'm using  
> release, not dev), and I'm also not yet getting it when I use  
> update.packages - should it be up already? If there's some kind of  
> timelag I should wait for, that's fine, but I'd thought things were  
> usually pretty instantaneous (or perhaps overnight), so thought I'd  
> better check back with you whether the new version somehow failed to  
> post.
>
> I can understand the problem with getSeq - for me, I think your  
> workaround #2 will be totally fine (I'd actually started with seqs  
> specified in RangedData objects, including start and end coords - I  
> only started using names without coords to isolate the problem).
>
> The option to disable or enable the grep feature sounds like a good  
> fix. For me, it seems more natural that the default should be to  
> disable grep, but perhaps there too many other users who've got used  
> to expecting the grep. Either way, it'd be great to explain it in  
> the getSeq help page - I imagine that'll be the case in 1.18.2.
>
> Janet
>
>
>
>
> On Nov 18, 2010, at 10:15 PM, Hervé Pagès wrote:
>
>> On 11/15/2010 06:51 PM, Janet Young wrote:
>>> Hi again,
>>>
>>> I'm interested in some sequences on the cow chrUn scaffolds, and am
>>> having a bit of bother getting them. I think I might have  
>>> uncovered a
>>> bug, although I might just be doing something wrong. The code and  
>>> output
>>> below should explain all. Any suggestions?
>>>
>>> thanks (again!),
>>
>> Oops, you found another one. More below...
>>
>>>
>>> Janet Young
>>>
>>> -------------------------------------------------------------------
>>>
>>> Dr. Janet Young (Trask lab)
>>>
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Avenue N., C3-168,
>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>>
>>> tel: (206) 667 1471 fax: (206) 667 6524
>>> email: jayoung ...at... fhcrc.org
>>>
>>> http://www.fhcrc.org/labs/trask/
>>>
>>> -------------------------------------------------------------------
>>>
>>>
>>> library(BSgenome.Btaurus.UCSC.bosTau4)
>>>
>>> ###### this works, gives me a 50bp sequence
>>> getSeq(Btaurus,"chr1",start=1,end=50)
>>> [1] "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
>>>
>>> ####### for some scaffolds in the chrUn.scaffolds pile, I don't  
>>> get an
>>> error message, but getSeqs seems to ignore the start and end  
>>> coordinates
>>> requested - e.g. the sequence returned here is the whole scaffold,  
>>> not
>>> just the first 50bp
>>> getSeq(Btaurus,"chrUn.004.11829",start=1,end=50)
>>> [1]
>>> "TCATGTGTTTCTTCCAGTCCAGCATTTCTCATGATGTACTCTGCATATAAGTTAAATAAACAGGGTG 
>>> ACAATATACAGCCTTGATGAACTCCTTTTCCTATTTGGAACCAGTCTGTTGTTCCATGTCCAGTTCTAACTGTTGCTTCCTGACCTGCATACAGATTTCTCAAGAGGCAGATCAGGTGTTCTCATCTCCTGAGAATTGAAGGTACAAATTGTAGTGTTTCAATTGGCACCATGCTAATTTATCTTGGCCTAAAATAGTGAATGGGCTTCCCTGGTGGCTCAGGTGGTAAAGAATCTGCCTGCAATGCTGGAGACCTGGGTTCAATATCTGGGTTGGGAAGATTACCTGGAGGAGGGCATGGAGGCTTACTCGAATATTCTTGCCTGGAAAATCTCCATGGACAGAGAAGCTGGGTGGGTTACTGTCCATGGGGTCGCAAAGAGTCAGACGTGACTGAGCAACTAAGCACAGCACAACACAAAATAGTGAATACTGAGCAAGTAAAGGAAAAACCTCTTCCTCTCAGAAATTGGTCTTCATTTTTTCATGAGAATTGCTAGTCTTCCTCCCAAAGCCAAAACCATAAATTTGTTAGTGTTTGACCTCAATATATTTTCTCTTAACTCAGCTTTTAAACCTTCTCTGCCTCCTGCTACCATTCACTTTCTAGTACATTTGAAATCTGTCCAAGCCATTCCTGGGGTTCAGGTGTCTGAGACCTGATTTATTTCATTGATATATTAAAACACCCTTGAATCCAGCCAACGTATGTGGCCAGTTTTACTTGCTTTGCTCCCATACTGGTAATGGAATTTTTATGGCTGTAAAATATCTGGGTCATGTGGCATTTTCATCTTCTGTTGTCTTGAGCTGGTATAGTTTTACCAACGTGCCATTAAGGGATGGTTCCTTTACCATCATTGTGCTTCCTGGGGCCTTGCCCACTTTGCACTGTAAGTCAGAACAAGAGACCCTCCAAG
>> TATTTAATTTCC"
>>
>> Yes, the new code was ignoring the 'end'. This is now fixed.
>>
>>>
>>>
>>> #### for other scaffolds I just get an error message, although the  
>>> named
>>> scaffold definitely exists (is it doing a partial match on the  
>>> name, not
>>> an exact match?)
>>
>> Yes it's doing partial matching (it uses grep()). This feature was
>> added a long time ago on a user's request. Note that the BSgenome
>> package for Cow is peculiar with its chrUn.scaffolds multiple  
>> sequence
>> object. Multiple sequences objects in the BSgenome packages are
>> generally the upstream* objects which contain long/ugly sequence
>> names like
>>
>> > names(Btaurus$upstream1000)[1:5]
>> [1] "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
>> [2] "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
>> [3] "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
>> [4] "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
>> [5] "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"
>>
>> so it seemed convenient to be able to specify only part of a name,
>> at least when working interactively.
>> Currently Cow and Zebrafish are the only BSgenome packages with
>> multiple sequences objects other than the upstream[125]000 objects.
>>
>> Thanks to your feedback, it is now clear that the "grep feature" is
>> not always desirable. Here are 2 workarounds:
>>
>> 1) Modify the regular expression so that it will match the whole  
>> thing:
>>
>> > getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
>>   54139-letter "DNAString" instance
>> seq:  
>> AAATCTAGAGTAGAGTTTGGAATTTCAGATATACAA 
>> ...GTGCTGCAATTCATGGGGTCACAAAGAGTCAGACAT
>>
>> ('as.character=FALSE' used here only to produce a compact output)
>> Note that this is not completely safe because the dots in
>> "chrUn.004.1022" could match any character. Using "^chrUn\\.004\\. 
>> 1022$"
>> would be safer. But, performance aside, having to escape all special
>> characters in the regex just to have it behave like if fixed=TRUE
>> was passed to the call to grep() doesn't sound very appealing.
>>
>> 2) So here is another one. With BSgenome 1.18.2, the "grep feature"
>> will be disable when you use a high level object like GRanges or
>> RangedData. So you will be able to do something like:
>>
>> > getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1,  
>> end=10)), as.character=FALSE)
>>   A DNAStringSet instance of length 1
>>     width seq
>> [1]    10 AAATCTAGAG
>>
>> But this workaround has its issues too (e.g. if I want to get the
>> entire sequence I need to know its end in advance).
>>
>> OK none of those workarounds is totally satisfying so I'll try to
>> come up with something better for BSgenome 1.18.3 (maybe add an
>> 'exact.match' argument to getSeq() that would be FALSE by default?)
>> Suggestions are welcome.
>>
>> Thanks for your feedback,
>> H.
>>
>>>
>>> getSeq(Btaurus,"chrUn.004.1022")
>>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],  
>>> start[i], :
>>> sequence chrUn.004.1022 found more than once, please use a non- 
>>> ambiguous
>>> name
>>>
>>> which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
>>> [1] 1022
>>>
>>> grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
>>> [1] 1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229
>>>
>>> sessionInfo()
>>>
>>> R version 2.12.0 (2010-10-15)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
>>> [3] Biostrings_2.18.0 GenomicRanges_1.2.1
>>> [5] IRanges_1.8.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.10.0
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> -- 
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>