[Bioc-devel] seqnames missing in headerTabix()

Anita Lerch anita.lerch at fmi.ch
Wed Aug 24 10:26:56 CEST 2011


Hi,
thanks for the answer. The download worked fine (sorry I deleted the
output off the download).

In between time I figured out the problem. The file was not bgzip.

$ ./tabix -p gff ../Drosophila_melanogaster.BDGP5.25.62.gtf.gz
[tabix] was bgzip used to compress this
file? ../Drosophila_melanogaster.BDGP5.25.62.gtf.gz

I fixed it with 

$ (grep ^"#" ../Drosophila_melanogaster.BDGP5.25.62.gtf; grep -v
^"#" ../Drosophila_melanogaster.BDGP5.25.62.gtf | sort -k1,1 -k4,4n)
| ./bgzip > ../Drosophila_melanogaster.BDGP5.25.62.gtf.bgz

$ ./tabix ../Drosophila_melanogaster.BDGP5.25.62.gtf.bgz -p gff

There is nothing written about bgzip in the ?TabixFile manual page and
the file extension of example.gtf.gz is misleading.
I never had a look at the ?indexTabix manual page, where it is clearly
written that it has to be bgzip.

Is it possible to forward the error message in future somehow?

I have to mention, that Rsamtools is just great. Thanks a lot for it.
Greetings
Anita

On Tue, 2011-08-23 at 09:14 -0700, Valerie Obenchain wrote:
> Hi Anita,
> 
> It looks like the download may not have worked. Check your gtfFn file to 
> see if the data are really there,
> 
>      less Drosophila_melanogaster.BDGP5.25.62.gtf.gz
> 
> Once you are sure of the download you may want to check the file for the 
> usual things -
> (1) no comments lines starting with #
> (2) the file is tab separated, not space separated
> 
> Coming from ensembl these should not be a problem.
> 
> Valerie
> 
> 
> On 08/23/2011 07:02 AM, Anita Lerch wrote:
> > Hi,
> >
> > I tried to stream a 'gtf' file from the ensemble with the Tabix methods.
> > The creation of the index files seems to work, but when I checked it
> > with headerTabix(tbx)$seqnames and got character(0).
> > Of course the scanTabix() didn't worked then too.
> > I do not have this problem with the example file in the Rsamtools
> > package.
> > Does anybody has an explanation for this?
> >
> > Thanks in advance,
> > Anita
> >
> >> library(Rsamtools)
> >> url<- "ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz"
> >> gtfFn<- "Drosophila_melanogaster.BDGP5.25.62.gtf.gz"
> >> download.file(url, gtfFn, "wget")
> >> indexTabix(gtfFn, format="gff")
> > [1] "Drosophila_melanogaster.BDGP5.25.62.gtf.gz.tbi"
> >> tbx<- open(TabixFile(gtfFn))
> >> headerTabix(tbx)
> > $seqnames
> > character(0)
> >
> > $indexColumns
> >    seq start   end
> >      1     4     5
> >
> > $skip
> > [1] 0
> >
> > $comment
> > [1] "#"
> >
> > $header
> > character(0)
> >
> >> seqnamesTabix(tbx)
> > character(0)
> >> cat(yieldTabix(tbx, yieldSize=1L))
> >> param<- GRanges(c("3L", "3R"), IRanges(c(1, 1), width=100000))
> >> scanTabix(tbx, param=param)
> > Error: scanTabix: '3L' not present in tabix index
> >    path: /home_fmi/01/lerchani/workspace/Drosophila_melanogaster.BDGP5.25.62.gtf.gz
> >
> >> sessionInfo()
> > R Under development (unstable) (2011-08-23 r56776)
> > Platform: x86_64-unknown-linux-gnu (64-bit)
> >
> > locale:
> >   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C
> >   [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > other attached packages:
> > [1] Rsamtools_1.5.51     Biostrings_2.21.9    GenomicRanges_1.5.28 IRanges_1.11.24
> >
> > loaded via a namespace (and not attached):
> > [1] BSgenome_1.21.3     RCurl_1.6-9         rtracklayer_1.13.11 tools_2.14.0        XML_3.4-2           zlibbioc_0.1.7
> >
> 

-- 
Anita Lerch
Friedrich Miescher Institute
Maulbeerstrasse 66
WRO-1066.P22
4058 Basel
Phone: +41 (0)61 697 5172
Email: anita.lerch at fmi.ch



More information about the Bioc-devel mailing list