[Bioc-sig-seq] rtracklayer BUG!? import.bed pastes/prefixes 'chr' during import of SOME records

Cook, Malcolm MEC at stowers.org
Thu Jun 9 14:44:30 CEST 2011


Thanks for those changes.  They work for me now as hoped/expected.  Great!  

I am using R version 2.13.0, biocLite version 2.8.4 which installs Bioconductor version 2.8 packages, which of course does not include rtracklayer 1.13.3

The following worked to install, but I wonder if you advise against it:

> svn co https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/rtracklayer
> export  _R_CHECK_FORCE_SUGGESTS_=false
# so as to avoid: Packages required but not available:  humanStemCell BSgenome.Hsapiens.UCSC.hg19
> R CMD CHECK rtracklayer
> R CMD INSTALL rtracklayer

Being new to R/Bioconductor, I am not (yet?) in the practice of running the development R to pick up changes to development Bioconductor packages such as this, and am a bit of a novice at R package management.

Thanks for any advice, and again, thanks for rtracklayer.

By the way, I noticed others on the mailing list who were stymied in trying to reproduce an older vignette.  Drat for them - I feel their pain.  However, I do think you made the right decision for the tool.  I hope you can stick to it!



From: Michael Lawrence [mailto:lawrence.michael at gene.com] 
Sent: Friday, June 03, 2011 1:31 PM
To: Cook, Malcolm
Cc: Michael com>; bioc-sig-sequencing at r-project.org
Subject: Re: rtracklayer BUG!? import.bed pastes/prefixes 'chr' during import of SOME records

On Fri, Jun 3, 2011 at 10:17 AM, Cook, Malcolm <MEC at stowers.org> wrote:
Has this been fixed in devel (the removal of the renaming feature)?

Yes, I think I checked that in.
In general, can you recommend a "best way" of staying abreast with changes to your (and others) bioconductor . 
 I've been playing around with cobbling an RSS feed of bioconductor submits (http://research.stowers.org/mec/bioc-changeLog/log-rss.xml)  (not yet on cron).
But the granularity and specificity don't feel quite right....

It would be nice to have something like this. One like it used to exist. This one is looking good. Obvious things would be package-specific filtering, and showing only an abbreviated commit message at the top-level, allowing drill-down for details.

From: Cook, Malcolm 
Sent: Wednesday, June 01, 2011 12:21 AM
To: Michael com>

Subject: Re: rtracklayer BUG!? import.bed pastes/prefixes 'chr' during import of SOME records
On Tue, May 31, 2011 at 2:33 PM, Cook, Malcolm <MEC at stowers.org> wrote:
Hi Michael,
Regarding your workarounds:
I find that passing trackLine=FALSE does not work as you suggest when the file contains a track line.  I get this error
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'an integer', got 'name=B52_merged.peak.ES <http://B52_merged.peak.ES> '

Thanks, fixed in devel.
However, I find that importing with asRangedData=FALSE does work.  I can convert to RangedData after the import.
Regarding the issue of the presence of track lines being UCSC specific:
I can give you one example where track lines are NOT specific to UCSC, namely IGV, which reads .bed files.  IGV also interprets the track lines (to some degree).  See http://www.broadinstitute.org/software/igv/TrackLine 
IGVmakes no requirements that the chromosome names follow the naming convention wherein the chromosomes must begin with 'chr'.
Does this convince you?

Yea, I guess. IGB is the same way. I just wish there were more formal definitions of the file formats.
How about this argument: There are genomes hosted at ucsc that are not assembled to the chromosome level, and whose identifiers don't begin with 'chr', such as elephant.  http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=196975635&clade=mammal&org=Elephant&db=0

This helps and would argue against any sort of coercion of the names. It seems reasonable to just leave it up to the user.
Regarding rtracklayer code:
I'm pretty sure I've been using import.bed with track files having tracklines and using asRangedData=TRUE without experiencing a renaming of chromosomes.  
Can you please confirm my suspicion that renaming the chromosomes on import is a new behavior of rtracklayer since the new bioconductor release?  

Maybe. The coercion of the names has been there forever when coercing to UCSCData (for upload to the UCSC browser). Maybe it has somehow come to apply to the import of tracks, as well, but I can't track down when that happened, due to the complexity of the code. The logic involved in importing a track is far more complex than one might imagine. Not sure how it got that way.
If the behavior is indeed new, would you consider recoding to be backward compatible by providing a new option, something like "recodeToUCSCStandards" (or something), with a default of FALSE.  
I don't have allot of code that depends on backward compatibility and indeed your second workaround will work for all my cases.  But, I might not be alone, and it might save you some grief.
Also, I think you might argue yourself into agreeing with me that renaming like this should really by default be OFF (if done at all).

I think I'm convinced and will go ahead and disable the renaming in all cases. It's hard to think of situations where the user cannot simply perform the renaming (it's hardly ever necessary anyway), and making it optional would just increase the complexity further.
At the very least, the renaming should be documented somewhere.  Excuse me if I missed it.... I did look for it.
Thanks for everything!
I have a related issue regarding how to alter the seqnames in a BSgenome.... I swear I had a working approach for this that longer works with the recent bioC update, giving me "  unable to find an inherited method for function "seqinfo<-", for signature "BSgenome"" .  I'll open another thread....

From: Michael Lawrence [mailto:lawrence.michael at gene.com] 
Sent: Tuesday, May 31, 2011 3:54 PM
To: Cook, Malcolm
Cc: bioc-sig-sequencing at r-project.org; Michael Lawrence
Subject: Re: rtracklayer BUG!? import.bed pastes/prefixes 'chr' during import of SOME records


Your BED file contains a "track line" which is something that is defined by the UCSC Genome Browser. I view (and others might disagree, in fact I often disagree with myself about this) that the track lines are specific to UCSC and not in the core BED format. Thus when rtracklayer sees a track line, it by default attempts to make the chromosomes conform to the UCSC convention. 

This can be avoided in a number of ways. First, passing trackLine=FALSE will cause the track line to be ignored. Second, passing asRangedData=FALSE will cause the return value to be a GRanges, in which case it cannot hold the track line data (at least not in any formal way -- it could go into the metadata) and thus the track line is ignored.

So those are workarounds. I could be convinced though that presence of a track line does NOT imply conformance to UCSC conventions. 

I fixed the incomplete conversion in devel. Please let me know if it needs to be back-ported.


On Tue, May 31, 2011 at 12:53 PM, Cook, Malcolm <MEC at stowers.org> wrote:
I find that rtracklayer's import.bed function is pasting the string 'chr' to SOME of my chromosome names in my bed file ( f, attached)

It is as if it is trying to rename (some) of them to (partially) agree with ucsc's naming convention.

I would expect import.bed  to preserve column 1 as the name of the 'space', but it does not.

But... some of the spaces have had a 'chr' prefix added!

Look at the session transcipt below!

I _think_ this behavior is NEW since I recently upgraded.

I hope not to have to revert to previous version of R / bioconductor to address this issue.

Any suggestions?


Malcolm Cook
Computational Biology - Stowers Institute for Medical Research

> library(rtracklayer)
Loading required package: RCurl
Loading required package: bitops
> unique(read.table('f.bed',sep="\t",skip=1)[,1])
 [1] YHet                      dmel_mitochondrion_genome
 [3] 2L                        X
 [5] 3L                        4
 [7] 2R                        3R
 [9] Uextra                    2RHet
[11] 2LHet                     3LHet
[13] 3RHet                     U
[15] XHet
15 Levels: 2L 2LHet 2R 2RHet 3L 3LHet 3R 3RHet 4 U Uextra X XHet ... dmel_mitochondrion_genome
> unique(space(import.bed ('f.bed')))
 [1] chr2L                     2LHet
 [3] chr2R                     2RHet
 [5] chr3L                     3LHet
 [7] chr3R                     3RHet
 [9] chr4                      chrU
[11] Uextra                    chrX
[13] XHet                      YHet
[15] dmel_mitochondrion_genome
15 Levels: chr2L 2LHet chr2R 2RHet chr3L 3LHet chr3R 3RHet chr4 chrU ... dmel_mitochondrion_genome
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.12.2 RCurl_1.5-0        bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] BSgenome_1.20.0     Biostrings_2.20.1   GenomicRanges_1.4.6
[4] IRanges_1.10.4      XML_3.4-0


More information about the Bioc-sig-sequencing mailing list