[Bioc-devel] scanTabix coercion to data.frame

Thu Apr 12 18:07:00 CEST 2012

On 04/12/2012 06:54 AM, Michael Lawrence wrote:
> You can use rtracklayer to import tabix files directly. If it's GFF or
> BED, you can just use import(). For arbitrary tabular files, first cast
> the path to a TabixFile, then pass it to import(). That last one is not
> well tested. It uses the header information to know the starts, ends, etc.

a little bit of a thread hijack but I've been toying with adding 
yieldSize to RsamtoolsFile, so that

   tbx = open(TabixFile("foo.tbx", yieldSize=10000L)
   while (length(res <- scanTabix(tbx, param=GRanges()) {
      ## do stuff
   }
   close(tbx)

would iterate through the file returning yieldSize records at a time, 
and similarly for BamFile, etc.

My two reservations are: (a) the yieldSize only sort of makes sense when 
there are multiple GRanges, because a particular range would be filled 
across different calls; I don't really want to get into alternative ways 
of iterating (e.g., yieldBy="range"), this gets too messy and it's all 
in C; (b) the yieldSize seems naturally to be a property of TabixFile, 
but the existing convention is to have a separate param object, one 
alternative would be to create ScanTabixParam with ranges and yieldSize 
and another would be to move param info into TabixFile (and similarly 
for other *File)

Martin

>
> Michael
>
> On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
> <florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>> wrote:
>
>     Sean, Martin, thanks for the suggestions. I guess a combination of
>     the two
>     would work well for me. I create my own tabix files and could certainly
>     stick the type information in the header. And I wasn't aware of
>     textConnection(), which seems to be performant enough to do what I want.
>     At least it is much better than my manual parsing...
>     One problem remains, though: the tabix files are being created from
>     within
>     R, and I don't think there is any support to add arbitrary header lines
>     available yet. Or is there?
>
>     Florian
>
>
>     Florian Hahne
>     Novartis Institute For Biomedical Research
>     Translational Sciences / Preclinical Safety / PCS Informatics
>     Expert Data Integration and Modeling Bioinformatics
>     CHBS, WKL-135.2.26
>     Novartis Institute For Biomedical Research, Werk Klybeck
>     Klybeckstrasse 141
>     CH-4057 Basel
>     Switzerland
>     Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>     Email : florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>
>
>
>
>
>
>
>
>     On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov
>     <mailto:sdavis2 at mail.nih.gov>> wrote:
>
>      >On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>      >> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>      >>>
>      >>> Hi all,
>      >>> I frequently get into the situation that I import data from a Tabix
>      >>>file
>      >>> using scanTabix and get a list of character vectors which I
>     first need
>      >>>to
>      >>> split back into columns using strsplit, followed by some type
>     coercion
>      >>>and
>      >>> lapply/sapply to actually get a list of data.frames which is
>     what I'd
>      >>> really want out in the first place. I may be missing something
>     here,
>      >>>but
>      >>> wouldn't it be possible to ask scanTabix for a list of data.frames
>      >>> directly, and maybe even providing a vector of data types to coerce
>      >>>into,
>      >>> a la 'colClasses' in read.table? It just seems to me that these
>      >>>operations
>      >>> could be done much more efficiently on the C level.
>      >>
>      >>
>      >> It's definitely poorly developed but one doesn't really want to
>      >>re-invent
>      >> too much of the parsing wheel. Does
>      >>
>      >>  res <- scanTabix("/foo.tbx")
>      >>  read.table(textConnection(res), header=TRUE, sep="\t")
>      >>
>      >> do the trick in a reasonably performant way? Obviously less than
>     ideal,
>      >>with
>      >> the data represented as character vectors and then as data.frame. A
>      >>better
>      >> solution (colClasses ==> data.frame) wouldn't be impossible, but
>      >>guessing
>      >> column types would be a lot of redundant work.
>      >
>      >Since tabix allows arbitrary header lines, one could store metadata in
>      >the first few lines and use that to store column info and classes.
>      >One can get at the header using Rsamtools
>      >headerTabix(TabixFile('foo.tbx')).  This is getting more toward
>      >developer-land than end-user, though, since the tabix file would need
>      >to be created with these uses in mind.
>      >
>      >Sean
>      >
>      >
>      >>> Thanks,
>      >>> Florian
>      >>>
>      >>>
>      >>> Florian Hahne
>      >>> Novartis Institute For Biomedical Research
>      >>> Translational Sciences / Preclinical Safety / PCS Informatics
>      >>> Expert Data Integration and Modeling Bioinformatics
>      >>> CHBS, WKL-135.2.26
>      >>> Novartis Institute For Biomedical Research, Werk Klybeck
>      >>> Klybeckstrasse 141
>      >>> CH-4057 Basel
>      >>> Switzerland
>      >>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>      >>> Email : florian.hahne at novartis.com
>     <mailto:florian.hahne at novartis.com>
>      >>>
>      >>> _______________________________________________
>      >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>      >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      >>
>      >>
>      >>
>      >> --
>      >> Computational Biology
>      >> Fred Hutchinson Cancer Research Center
>      >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>      >>
>      >> Location: M1-B861
>      >> Telephone: 206 667-2793 <tel:206%20667-2793>
>      >>
>      >>
>      >> _______________________________________________
>      >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>      >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793