[Bioc-devel] scanTabix coercion to data.frame

Sean Davis sdavis2 at mail.nih.gov
Thu Apr 12 16:49:11 CEST 2012


On Thu, Apr 12, 2012 at 9:54 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> You can use rtracklayer to import tabix files directly. If it's GFF or BED,
> you can just use import(). For arbitrary tabular files, first cast the path
> to a TabixFile, then pass it to import(). That last one is not well tested.
> It uses the header information to know the starts, ends, etc.

Thanks, Michael.  I had forgotten to mention the detail that, more
generally, the chromosome, start, and end are known and stored with
the tabix index.

Sean


> Michael
>
> On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
> <florian.hahne at novartis.com>wrote:
>
>> Sean, Martin, thanks for the suggestions. I guess a combination of the two
>> would work well for me. I create my own tabix files and could certainly
>> stick the type information in the header. And I wasn't aware of
>> textConnection(), which seems to be performant enough to do what I want.
>> At least it is much better than my manual parsing...
>> One problem remains, though: the tabix files are being created from within
>> R, and I don't think there is any support to add arbitrary header lines
>> available yet. Or is there?
>>
>> Florian
>>
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>> On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:
>>
>> >On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org>
>> wrote:
>> >> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>> >>>
>> >>> Hi all,
>> >>> I frequently get into the situation that I import data from a Tabix
>> >>>file
>> >>> using scanTabix and get a list of character vectors which I first need
>> >>>to
>> >>> split back into columns using strsplit, followed by some type coercion
>> >>>and
>> >>> lapply/sapply to actually get a list of data.frames which is what I'd
>> >>> really want out in the first place. I may be missing something here,
>> >>>but
>> >>> wouldn't it be possible to ask scanTabix for a list of data.frames
>> >>> directly, and maybe even providing a vector of data types to coerce
>> >>>into,
>> >>> a la 'colClasses' in read.table? It just seems to me that these
>> >>>operations
>> >>> could be done much more efficiently on the C level.
>> >>
>> >>
>> >> It's definitely poorly developed but one doesn't really want to
>> >>re-invent
>> >> too much of the parsing wheel. Does
>> >>
>> >>  res <- scanTabix("/foo.tbx")
>> >>  read.table(textConnection(res), header=TRUE, sep="\t")
>> >>
>> >> do the trick in a reasonably performant way? Obviously less than ideal,
>> >>with
>> >> the data represented as character vectors and then as data.frame. A
>> >>better
>> >> solution (colClasses ==> data.frame) wouldn't be impossible, but
>> >>guessing
>> >> column types would be a lot of redundant work.
>> >
>> >Since tabix allows arbitrary header lines, one could store metadata in
>> >the first few lines and use that to store column info and classes.
>> >One can get at the header using Rsamtools
>> >headerTabix(TabixFile('foo.tbx')).  This is getting more toward
>> >developer-land than end-user, though, since the tabix file would need
>> >to be created with these uses in mind.
>> >
>> >Sean
>> >
>> >
>> >>> Thanks,
>> >>> Florian
>> >>>
>> >>>
>> >>> Florian Hahne
>> >>> Novartis Institute For Biomedical Research
>> >>> Translational Sciences / Preclinical Safety / PCS Informatics
>> >>> Expert Data Integration and Modeling Bioinformatics
>> >>> CHBS, WKL-135.2.26
>> >>> Novartis Institute For Biomedical Research, Werk Klybeck
>> >>> Klybeckstrasse 141
>> >>> CH-4057 Basel
>> >>> Switzerland
>> >>> Phone: +41 61 6967127
>> >>> Email : florian.hahne at novartis.com
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel at r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> >>
>> >> --
>> >> Computational Biology
>> >> Fred Hutchinson Cancer Research Center
>> >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>> >>
>> >> Location: M1-B861
>> >> Telephone: 206 667-2793
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list