[Bioc-devel] scanTabix coercion to data.frame

Sean Davis sdavis2 at mail.nih.gov
Thu Apr 12 14:08:01 CEST 2012

On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>> Hi all,
>> I frequently get into the situation that I import data from a Tabix file
>> using scanTabix and get a list of character vectors which I first need to
>> split back into columns using strsplit, followed by some type coercion and
>> lapply/sapply to actually get a list of data.frames which is what I'd
>> really want out in the first place. I may be missing something here, but
>> wouldn't it be possible to ask scanTabix for a list of data.frames
>> directly, and maybe even providing a vector of data types to coerce into,
>> a la 'colClasses' in read.table? It just seems to me that these operations
>> could be done much more efficiently on the C level.
> It's definitely poorly developed but one doesn't really want to re-invent
> too much of the parsing wheel. Does
>  res <- scanTabix("/foo.tbx")
>  read.table(textConnection(res), header=TRUE, sep="\t")
> do the trick in a reasonably performant way? Obviously less than ideal, with
> the data represented as character vectors and then as data.frame. A better
> solution (colClasses ==> data.frame) wouldn't be impossible, but guessing
> column types would be a lot of redundant work.

Since tabix allows arbitrary header lines, one could store metadata in
the first few lines and use that to store column info and classes.
One can get at the header using Rsamtools
headerTabix(TabixFile('foo.tbx')).  This is getting more toward
developer-land than end-user, though, since the tabix file would need
to be created with these uses in mind.


>> Thanks,
>> Florian
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> Location: M1-B861
> Telephone: 206 667-2793
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list