[Bioc-devel] seqnames missing in headerTabix()

Martin Morgan mtmorgan at fhcrc.org
Wed Aug 24 16:35:09 CEST 2011


On 08/24/2011 01:26 AM, Anita Lerch wrote:
> Hi,
> thanks for the answer. The download worked fine (sorry I deleted the
> output off the download).
>
> In between time I figured out the problem. The file was not bgzip.
>
> $ ./tabix -p gff ../Drosophila_melanogaster.BDGP5.25.62.gtf.gz
> [tabix] was bgzip used to compress this
> file? ../Drosophila_melanogaster.BDGP5.25.62.gtf.gz
>
> I fixed it with
>
> $ (grep ^"#" ../Drosophila_melanogaster.BDGP5.25.62.gtf; grep -v
> ^"#" ../Drosophila_melanogaster.BDGP5.25.62.gtf | sort -k1,1 -k4,4n)
> | ./bgzip>  ../Drosophila_melanogaster.BDGP5.25.62.gtf.bgz
>
> $ ./tabix ../Drosophila_melanogaster.BDGP5.25.62.gtf.bgz -p gff
>
> There is nothing written about bgzip in the ?TabixFile manual page and
> the file extension of example.gtf.gz is misleading.
> I never had a look at the ?indexTabix manual page, where it is clearly
> written that it has to be bgzip.
>
> Is it possible to forward the error message in future somehow?

Hi Anita -- thanks, yes, the error is now (version 1.5.54) reported when 
indexTabix is applied to a non-bgzip'd file. Martin

>
> I have to mention, that Rsamtools is just great. Thanks a lot for it.
> Greetings
> Anita
>
> On Tue, 2011-08-23 at 09:14 -0700, Valerie Obenchain wrote:
>> Hi Anita,
>>
>> It looks like the download may not have worked. Check your gtfFn file to
>> see if the data are really there,
>>
>>       less Drosophila_melanogaster.BDGP5.25.62.gtf.gz
>>
>> Once you are sure of the download you may want to check the file for the
>> usual things -
>> (1) no comments lines starting with #
>> (2) the file is tab separated, not space separated
>>
>> Coming from ensembl these should not be a problem.
>>
>> Valerie
>>
>>
>> On 08/23/2011 07:02 AM, Anita Lerch wrote:
>>> Hi,
>>>
>>> I tried to stream a 'gtf' file from the ensemble with the Tabix methods.
>>> The creation of the index files seems to work, but when I checked it
>>> with headerTabix(tbx)$seqnames and got character(0).
>>> Of course the scanTabix() didn't worked then too.
>>> I do not have this problem with the example file in the Rsamtools
>>> package.
>>> Does anybody has an explanation for this?
>>>
>>> Thanks in advance,
>>> Anita
>>>
>>>> library(Rsamtools)
>>>> url<- "ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz"
>>>> gtfFn<- "Drosophila_melanogaster.BDGP5.25.62.gtf.gz"
>>>> download.file(url, gtfFn, "wget")
>>>> indexTabix(gtfFn, format="gff")
>>> [1] "Drosophila_melanogaster.BDGP5.25.62.gtf.gz.tbi"
>>>> tbx<- open(TabixFile(gtfFn))
>>>> headerTabix(tbx)
>>> $seqnames
>>> character(0)
>>>
>>> $indexColumns
>>>     seq start   end
>>>       1     4     5
>>>
>>> $skip
>>> [1] 0
>>>
>>> $comment
>>> [1] "#"
>>>
>>> $header
>>> character(0)
>>>
>>>> seqnamesTabix(tbx)
>>> character(0)
>>>> cat(yieldTabix(tbx, yieldSize=1L))
>>>> param<- GRanges(c("3L", "3R"), IRanges(c(1, 1), width=100000))
>>>> scanTabix(tbx, param=param)
>>> Error: scanTabix: '3L' not present in tabix index
>>>     path: /home_fmi/01/lerchani/workspace/Drosophila_melanogaster.BDGP5.25.62.gtf.gz
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2011-08-23 r56776)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.5.51     Biostrings_2.21.9    GenomicRanges_1.5.28 IRanges_1.11.24
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BSgenome_1.21.3     RCurl_1.6-9         rtracklayer_1.13.11 tools_2.14.0        XML_3.4-2           zlibbioc_0.1.7
>>>
>>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-devel mailing list