[Bioc-devel] scanTabix coercion to data.frame

Thu Apr 12 15:10:55 CEST 2012

Sean, Martin, thanks for the suggestions. I guess a combination of the two
would work well for me. I create my own tabix files and could certainly
stick the type information in the header. And I wasn't aware of
textConnection(), which seems to be performant enough to do what I want.
At least it is much better than my manual parsing...
One problem remains, though: the tabix files are being created from within
R, and I don't think there is any support to add arbitrary header lines
available yet. Or is there?

Florian

Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com

On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:

>On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>>>
>>> Hi all,
>>> I frequently get into the situation that I import data from a Tabix
>>>file
>>> using scanTabix and get a list of character vectors which I first need
>>>to
>>> split back into columns using strsplit, followed by some type coercion
>>>and
>>> lapply/sapply to actually get a list of data.frames which is what I'd
>>> really want out in the first place. I may be missing something here,
>>>but
>>> wouldn't it be possible to ask scanTabix for a list of data.frames
>>> directly, and maybe even providing a vector of data types to coerce
>>>into,
>>> a la 'colClasses' in read.table? It just seems to me that these
>>>operations
>>> could be done much more efficiently on the C level.
>>
>>
>> It's definitely poorly developed but one doesn't really want to
>>re-invent
>> too much of the parsing wheel. Does
>>
>>  res <- scanTabix("/foo.tbx")
>>  read.table(textConnection(res), header=TRUE, sep="\t")
>>
>> do the trick in a reasonably performant way? Obviously less than ideal,
>>with
>> the data represented as character vectors and then as data.frame. A
>>better
>> solution (colClasses ==> data.frame) wouldn't be impossible, but
>>guessing
>> column types would be a lot of redundant work.
>
>Since tabix allows arbitrary header lines, one could store metadata in
>the first few lines and use that to store column info and classes.
>One can get at the header using Rsamtools
>headerTabix(TabixFile('foo.tbx')).  This is getting more toward
>developer-land than end-user, though, since the tabix file would need
>to be created with these uses in mind.
>
>Sean
>
>
>>> Thanks,
>>> Florian
>>>
>>>
>>> Florian Hahne
>>> Novartis Institute For Biomedical Research
>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>> Expert Data Integration and Modeling Bioinformatics
>>> CHBS, WKL-135.2.26
>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>> Klybeckstrasse 141
>>> CH-4057 Basel
>>> Switzerland
>>> Phone: +41 61 6967127
>>> Email : florian.hahne at novartis.com
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel