[Bioc-devel] Rsamtools Reading TabixFile URL

Martin Morgan mtmorgan at fhcrc.org
Fri Dec 23 00:38:30 CET 2011


On 12/22/2011 03:00 PM, Dario Strbenac wrote:
> Hi,
>
> yieldTabix won't work when the file is a URL. Here is an example.
>
> library(Rsamtools)
> txTabix<- open(TabixFile("http://savantbrowser.com/data/hg18/hg18.refGene.gz"))

Specifically, I get

 > tbx = TabixFile("http://savantbrowser.com/data/hg18/hg18.refGene.gz")
 > open(tbx)
Error in open.TabixFile(tbx) : failed to open file
In addition: Warning message:
In open.TabixFile(tbx) :
   [khttp_connect_file] fail to open file (HTTP code: 301).

The '301' error is generally file not found. If I open the URL in a 
browser I'm redirected to

   url = "http://genomesavant.com/savant//data/hg18/hg18.refGene.gz"

and things work out

   tbx = open(TabixFile(url))
   res <- yieldTabix(tbx)

> Also, I thought yieldSize was a maximum, not that it had to be that big :
>
>> txTabix<- TabixFile(system.file("extdata", "example.gtf.gz", package="Rsamtools"))
>> txGR<- yieldTabix(txTabix, yieldSize = Inf)
> Error: yield: negative length vectors are not allowed

The error isn't being very helpful, but R is trying to allocate an 
infinite amount of space for the result. This causes an integer 
overflow, reported as 'negative length vectors are not allowed'.

yieldSize is the maximum number of records to read in for each call to 
yieldTabix; the whole file if it is smaller than yieldSize.

>> txGR<- yieldTabix(txTabix, yieldSize = .Machine$integer.max)
> Error: yield: cannot allocate vector of size 16.0 Gb
>
> What is the most efficient way to read in all records, without over-allocating RAM ?

The yieldSize is the number of lines parsed, so is equivalent to an 
allocation of character(yieldSize). The maximum size allowed by R is 
.Machine$integer.max

I'm not sure what a good rule of thumb is for VCF files; each record 
could easily be 1000 characters, you'd need memory to manipulate the 
result, so I'd say a yieldSize of at most mem.size / 1000 / 10.

But I'm not sure you gain alot by having very large input chunks? The 
paradigm for processing the whole file is

   tbx = open(TabixFile(url))
   while (length(res <- yieldTabix(tbx))) {
       ## work on res
   }

Martin

>
> --------------------------------------
> Dario Strbenac
> Research Assistant
> Cancer Epigenetics
> Garvan Institute of Medical Research
> Darlinghurst NSW 2010
> Australia
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-devel mailing list