[Bioc-sig-seq] getSeq and Btaurus$chrUn.scaffolds - ambiguous name error

Fri Nov 19 07:15:32 CET 2010

On 11/15/2010 06:51 PM, Janet Young wrote:
> Hi again,
>
> I'm interested in some sequences on the cow chrUn scaffolds, and am
> having a bit of bother getting them. I think I might have uncovered a
> bug, although I might just be doing something wrong. The code and output
> below should explain all. Any suggestions?
>
> thanks (again!),

Oops, you found another one. More below...

>
> Janet Young
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung ...at... fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> -------------------------------------------------------------------
>
>
> library(BSgenome.Btaurus.UCSC.bosTau4)
>
> ###### this works, gives me a 50bp sequence
> getSeq(Btaurus,"chr1",start=1,end=50)
> [1] "TACCCCACTCACACTTATGGATAGATCAACTAAACAGAAAATTAACAAGG"
>
> ####### for some scaffolds in the chrUn.scaffolds pile, I don't get an
> error message, but getSeqs seems to ignore the start and end coordinates
> requested - e.g. the sequence returned here is the whole scaffold, not
> just the first 50bp
> getSeq(Btaurus,"chrUn.004.11829",start=1,end=50)
> [1]
> "TCATGTGTTTCTTCCAGTCCAGCATTTCTCATGATGTACTCTGCATATAAGTTAAATAAACAGGGTGACAATATACAGCCTTGATGAACTCCTTTTCCTATTTGGAACCAGTCTGTTGTTCCATGTCCAGTTCTAACTGTTGCTTCCTGACCTGCATACAGATTTCTCAAGAGGCAGATCAGGTGTTCTCATCTCCTGAGAATTGAAGGTACAAATTGTAGTGTTTCAATTGGCACCATGCTAATTTATCTTGGCCTAAAATAGTGAATGGGCTTCCCTGGTGGCTCAGGTGGTAAAGAATCTGCCTGCAATGCTGGAGACCTGGGTTCAATATCTGGGTTGGGAAGATTACCTGGAGGAGGGCATGGAGGCTTACTCGAATATTCTTGCCTGGAAAATCTCCATGGACAGAGAAGCTGGGTGGGTTACTGTCCATGGGGTCGCAAAGAGTCAGACGTGACTGAGCAACTAAGCACAGCACAACACAAAATAGTGAATACTGAGCAAGTAAAGGAAAAACCTCTTCCTCTCAGAAATTGGTCTTCATTTTTTCATGAGAATTGCTAGTCTTCCTCCCAAAGCCAAAACCATAAATTTGTTAGTGTTTGACCTCAATATATTTTCTCTTAACTCAGCTTTTAAACCTTCTCTGCCTCCTGCTACCATTCACTTTCTAGTACATTTGAAATCTGTCCAAGCCATTCCTGGGGTTCAGGTGTCTGAGACCTGATTTATTTCATTGATATATTAAAACACCCTTGAATCCAGCCAACGTATGTGGCCAGTTTTACTTGCTTTGCTCCCATACTGGTAATGGAATTTTTATGGCTGTAAAATATCTGGGTCATGTGGCATTTTCATCTTCTGTTGTCTTGAGCTGGTATAGTTTTACCAACGTGCCATTAAGGGATGGTTCCTTTACCATCATTGTGCTTCCTGGGGCCTTGCCCACTTTGCACTGTAAGTCAGAACAAGAGACCCTCCAAG
TATTTAATTTCC"

Yes, the new code was ignoring the 'end'. This is now fixed.

>
>
> #### for other scaffolds I just get an error message, although the named
> scaffold definitely exists (is it doing a partial match on the name, not
> an exact match?)

Yes it's doing partial matching (it uses grep()). This feature was
added a long time ago on a user's request. Note that the BSgenome
package for Cow is peculiar with its chrUn.scaffolds multiple sequence
object. Multiple sequences objects in the BSgenome packages are
generally the upstream* objects which contain long/ugly sequence
names like

   > names(Btaurus$upstream1000)[1:5]
   [1] "BC118489_up_1000_chr1_142608595_r chr1:142608595-142609594"
   [2] "BC151644_up_1000_chr1_44930841_f chr1:44930841-44931840"
   [3] "BC116005_up_1000_chr1_46052279_f chr1:46052279-46053278"
   [4] "BC133294_up_1000_chr1_44020590_f chr1:44020590-44021589"
   [5] "BC109604_up_1000_chr1_65024340_r chr1:65024340-65025339"

so it seemed convenient to be able to specify only part of a name,
at least when working interactively.
Currently Cow and Zebrafish are the only BSgenome packages with
multiple sequences objects other than the upstream[125]000 objects.

Thanks to your feedback, it is now clear that the "grep feature" is
not always desirable. Here are 2 workarounds:

1) Modify the regular expression so that it will match the whole thing:

   > getSeq(Btaurus,"^chrUn.004.1022$", as.character=FALSE)
     54139-letter "DNAString" instance
   seq: 
AAATCTAGAGTAGAGTTTGGAATTTCAGATATACAA...GTGCTGCAATTCATGGGGTCACAAAGAGTCAGACAT

('as.character=FALSE' used here only to produce a compact output)
Note that this is not completely safe because the dots in
"chrUn.004.1022" could match any character. Using "^chrUn\\.004\\.1022$"
would be safer. But, performance aside, having to escape all special
characters in the regex just to have it behave like if fixed=TRUE
was passed to the call to grep() doesn't sound very appealing.

2) So here is another one. With BSgenome 1.18.2, the "grep feature"
will be disable when you use a high level object like GRanges or
RangedData. So you will be able to do something like:

   > getSeq(Btaurus, GRanges("chrUn.004.1022", IRanges(start=1, 
end=10)), as.character=FALSE)
     A DNAStringSet instance of length 1
       width seq
   [1]    10 AAATCTAGAG

But this workaround has its issues too (e.g. if I want to get the
entire sequence I need to know its end in advance).

OK none of those workarounds is totally satisfying so I'll try to
come up with something better for BSgenome 1.18.3 (maybe add an
'exact.match' argument to getSeq() that would be FALSE by default?)
Suggestions are welcome.

Thanks for your feedback,
H.

>
> getSeq(Btaurus,"chrUn.004.1022")
> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], :
> sequence chrUn.004.1022 found more than once, please use a non-ambiguous
> name
>
> which ( names(Btaurus$chrUn.scaffolds) == "chrUn.004.1022" )
> [1] 1022
>
> grep ( "chrUn.004.1022" , names(Btaurus$chrUn.scaffolds) )
> [1] 1022 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229
>
> sessionInfo()
>
> R version 2.12.0 (2010-10-15)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Btaurus.UCSC.bosTau4_1.3.16 BSgenome_1.18.1
> [3] Biostrings_2.18.0 GenomicRanges_1.2.1
> [5] IRanges_1.8.2
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.10.0
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319