[Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error

Tue Nov 23 22:46:20 CET 2010

Hi Herve,

Thanks very much for this fix, and the other one from the same day  
(the one about space names as factors vs characters in getSeq).

I haven't been able to test it yet with my real data, as I'm not  
seeing BSgenome 1.18.2 up yet on the bioc website (I'm using release,  
not dev), and I'm also not yet getting it when I use update.packages -  
should it be up already? If there's some kind of timelag I should wait  
for, that's fine, but I'd thought things were usually pretty  
instantaneous (or perhaps overnight), so thought I'd better check back  
with you whether the new version somehow failed to post.

I can understand the problem with getSeq - for me, I think your  
workaround #2 will be totally fine (I'd actually started with seqs  
specified in RangedData objects, including start and end coords - I  
only started using names without coords to isolate the problem).

The option to disable or enable the grep feature sounds like a good  
fix. For me, it seems more natural that the default should be to  
disable grep, but perhaps there too many other users who've got used  
to expecting the grep. Either way, it'd be great to explain it in the  
getSeq help page - I imagine that'll be the case in 1.18.2.

Janet

On Nov 18, 2010, at 10:15 PM, Hervé Pagès wrote:

> On 11/15/2010 06:51 PM, Janet Young wrote:
>> Hi again,
>>
>> I'm interested in some sequences on the cow chrUn scaffolds, and am
>> having a bit of bother getting them. I think I might have uncovered a
>> bug, although I might just be doing something wrong. The code and  
>> output
>> below should explain all. Any suggestions?
>>
>> thanks (again!),
>
> Oops, you found another one. More below...
>
>>
>> Janet Young
>>
>> -------------------------------------------------------------------
>>
>> Dr. Janet Young (Trask lab)
>>
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Avenue N., C3-168,
>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>
>> tel: (206) 667 1471 fax: (206) 667 6524
>> email: jayoung ...at... fhcrc.org
>>
>> http://www.fhcrc.org/labs/trask/
>>
>> -------------------------------------------------------------------
>>
>>
>> library(BSgenome.Btaurus.UCSC.bosTau4)
>>
>> ###### this works, gives me a 50bp sequence
>> getSeq(Btaurus,"chr1",start=1,end=50)
>> [1] "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
>>
>> ####### for some scaffolds in the chrUn.scaffolds pile, I don't get  
>> an
>> error message, but getSeqs seems to ignore the start and end  
>> coordinates
>> requested - e.g. the sequence returned here is the whole scaffold,  
>> not
>> just the first 50bp
>> getSeq(Btaurus,"chrUn.004.11829",start=1,end=50)
>> [1]
>> "TCATGTGTTTCTTCCAGTCCAGCATTTCTCATGATGTACTCTGCATATAAGTTAAATAAACAGGGTGA 
>> CAATATACAGCCTTGATGAACTCCTTTTCCTATTTGGAACCAGTCTGTTGTTCCATGTCCAGTTCTAACTGTTGCTTCCTGACCTGCATACAGATTTCTCAAGAGGCAGATCAGGTGTTCTCATCTCCTGAGAATTGAAGGTACAAATTGTAGTGTTTCAATTGGCACCATGCTAATTTATCTTGGCCTAAAATAGTGAATGGGCTTCCCTGGTGGCTCAGGTGGTAAAGAATCTGCCTGCAATGCTGGAGACCTGGGTTCAATATCTGGGTTGGGAAGATTACCTGGAGGAGGGCATGGAGGCTTACTCGAATATTCTTGCCTGGAAAATCTCCATGGACAGAGAAGCTGGGTGGGTTACTGTCCATGGGGTCGCAAAGAGTCAGACGTGACTGAGCAACTAAGCACAGCACAACACAAAATAGTGAATACTGAGCAAGTAAAGGAAAAACCTCTTCCTCTCAGAAATTGGTCTTCATTTTTTCATGAGAATTGCTAGTCTTCCTCCCAAAGCCAAAACCATAAATTTGTTAGTGTTTGACCTCAATATATTTTCTCTTAACTCAGCTTTTAAACCTTCTCTGCCTCCTGCTACCATTCACTTTCTAGTACATTTGAAATCTGTCCAAGCCATTCCTGGGGTTCAGGTGTCTGAGACCTGATTTATTTCATTGATATATTAAAACACCCTTGAATCCAGCCAACGTATGTGGCCAGTTTTACTTGCTTTGCTCCCATACTGGTAATGGAATTTTTATGGCTGTAAAATATCTGGGTCATGTGGCATTTTCATCTTCTGTTGTCTTGAGCTGGTATAGTTTTACCAACGTGCCATTAAGGGATGGTTCCTTTACCATCATTGTGCTTCCTGGGGCCTTGCCCACTTTGCACTGTAAGTCAGAACAAGAGACCCTCCAAG
> TATTTAATTTCC"
>
> Yes, the new code was ignoring the 'end'. This is now fixed.
>
>>
>>
>> #### for other scaffolds I just get an error message, although the  
>> named
>> scaffold definitely exists (is it doing a partial match on the  
>> name, not
>> an exact match?)
>
> Yes it's doing partial matching (it uses grep()). This feature was
> added a long time ago on a user's request. Note that the BSgenome
> package for Cow is peculiar with its chrUn.scaffolds multiple sequence
> object. Multiple sequences objects in the BSgenome packages are
> generally the upstream* objects which contain long/ugly sequence
> names like
>
>  > names(Btaurus$upstream1000)[1:5]
>  [1] "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
>  [2] "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
>  [3] "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
>  [4] "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
>  [5] "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"
>
> so it seemed convenient to be able to specify only part of a name,
> at least when working interactively.
> Currently Cow and Zebrafish are the only BSgenome packages with
> multiple sequences objects other than the upstream[125]000 objects.
>
> Thanks to your feedback, it is now clear that the "grep feature" is
> not always desirable. Here are 2 workarounds:
>
> 1) Modify the regular expression so that it will match the whole  
> thing:
>
>  > getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
>    54139-letter "DNAString" instance
>  seq:  
> AAATCTAGAGTAGAGTTTGGAATTTCAGATATACAA 
> ...GTGCTGCAATTCATGGGGTCACAAAGAGTCAGACAT
>
> ('as.character=FALSE' used here only to produce a compact output)
> Note that this is not completely safe because the dots in
> "chrUn.004.1022" could match any character. Using "^chrUn\\.004\\. 
> 1022$"
> would be safer. But, performance aside, having to escape all special
> characters in the regex just to have it behave like if fixed=TRUE
> was passed to the call to grep() doesn't sound very appealing.
>
> 2) So here is another one. With BSgenome 1.18.2, the "grep feature"
> will be disable when you use a high level object like GRanges or
> RangedData. So you will be able to do something like:
>
>  > getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1,  
> end=10)), as.character=FALSE)
>    A DNAStringSet instance of length 1
>      width seq
>  [1]    10 AAATCTAGAG
>
> But this workaround has its issues too (e.g. if I want to get the
> entire sequence I need to know its end in advance).
>
> OK none of those workarounds is totally satisfying so I'll try to
> come up with something better for BSgenome 1.18.3 (maybe add an
> 'exact.match' argument to getSeq() that would be FALSE by default?)
> Suggestions are welcome.
>
> Thanks for your feedback,
> H.
>
>>
>> getSeq(Btaurus,"chrUn.004.1022")
>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i],  
>> start[i], :
>> sequence chrUn.004.1022 found more than once, please use a non- 
>> ambiguous
>> name
>>
>> which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
>> [1] 1022
>>
>> grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
>> [1] 1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229
>>
>> sessionInfo()
>>
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
>> [3] Biostrings_2.18.0 GenomicRanges_1.2.1
>> [5] IRanges_1.8.2
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.10.0
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> -- 
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319