[Bioc-devel] scanTabix coercion to data.frame

Dan Tenenbaum dtenenba at fhcrc.org
Mon Apr 16 20:41:54 CEST 2012


Hi Florian,

On Mon, Apr 16, 2012 at 12:31 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
>
> My bad, I updated all packages before trying this and never checked what
> actually happened.
> The odd thing is that I am running R-devel, I have the
> latest BiocInstaller 1.5.6 installed but I still only get the bioc release
> packages.:
> > sessionInfo()
> R Under development (unstable) (2012-04-16 r59045)
> Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] rtracklayer_1.16.1  GenomicRanges_1.8.3 IRanges_1.14.2
> [4] BiocGenerics_0.2.0  BiocInstaller_1.5.6
>
> loaded via a namespace (and not attached):
> [1] Biostrings_2.24.1 bitops_1.0-4.1    BSgenome_1.24.0   RCurl_1.91-1
>
> [5] Rsamtools_1.9.2   stats4_2.16.0     tools_2.16.0      XML_3.9-4
>
> [9] zlibbioc_1.2.0
>
> I think I followed Dan's instructions carefully, any idea why this is not
> working for me?
>

This is fixed in BiocInstaller 1.5.7.

Dan


> The little bit of debugging I tried revealed that biocInstallRepos does
> not give me the right repository path:
> > biocinstallRepos()
>                                                    BioCsoft
>            "http://www.bioconductor.org/packages/2.10/bioc"
>                                                        CRAN
>                                     "http://cran.fhcrc.org"
>                                                     BioCann
> "http://www.bioconductor.org/packages/2.10/data/annotation"
>                                                     BioCexp
> "http://www.bioconductor.org/packages/2.10/data/experiment"
>                                                   BioCextra
>           "http://www.bioconductor.org/packages/2.10/extra"
>
> Now in there I find:
> > BiocInstaller:::biocinstallRepos
> function (siteRepos = character())
> {
>     .biocinstallRepos(siteRepos = siteRepos, devel = .isDevel())
> }
> <environment: namespace:BiocInstaller>
>
> And .isDevel is defined as
>
> > BiocInstaller:::.isDevel
> function ()
> {
>     isOdd <- (packageVersion("BiocInstaller")$minor%%2L) == 1L
>     isOdd && (R.version$status == "" || R.version$status == "Patched")
> }
> <environment: namespace:BiocInstaller>
>
> I may be wrong here, but how can I ever get TRUE unless I am running R
> Patched or whatever R.version$status=="" refers to?  Since I am running R
> devel built from svn I have
> > R.version$status
> [1] "Under development (unstable)"
>
> So I will always and for all eternity get .isDevel()==FALSE…
>
> Florian
>
> Florian Hahne
> Novartis Institute For Biomedical Research
> Translational Sciences / Preclinical Safety / PCS Informatics
> Expert Data Integration and Modeling Bioinformatics
> CHBS, WKL-135.2.26
> Novartis Institute For Biomedical Research, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127
> Email : florian.hahne at novartis.com
>
>
> From: Michael Lawrence <lawrence.michael at gene.com>
> Date: Fri, 13 Apr 2012 09:19:20 -0700
>
> To: NIBR <florian.hahne at novartis.com>
> Cc: Michael Lawrence <lawrence.michael at gene.com>, Sean Davis
> <sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
> "bioc-devel at r-project.org" <bioc-devel at r-project.org>
> Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
>
>
>
> On Fri, Apr 13, 2012 at 8:15 AM, Hahne, Florian
> <florian.hahne at novartis.com> wrote:
>>
>> Yes, I tried this:
>> ff <-
>> TabixFile("/CHBS/apps/itox/data/project_data_repository/1/1/project.tbx")
>> foo <- import(ff)
>> Error: evaluation nested too deeply: infinite recursion /
>> options(expressions=)?
>>
>> And this:
>>
>>
>> foo <- import(ff, which=GRanges(seqnames="chrX", ranges=IRanges(start=1,
>> end=1e8)))
>> Error: evaluation nested too deeply: infinite recursion /
>> options(expressions=)?
>>
>> And then I gave up :-)
>>
>
>
> Ok, well I said the devel version, i.e., 1.17.1, not 1.16.1.
>
>>
>> > sessionInfo()
>> R Under development (unstable) (2012-04-03 r58904)
>> Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] rtracklayer_1.16.1  Rsamtools_1.9.2     Biostrings_2.24.1
>> [4] GenomicRanges_1.8.3 IRanges_1.14.2      BiocGenerics_0.2.0
>> [7] BiocInstaller_1.4.3
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-4.1  BSgenome_1.24.0 RCurl_1.91-1    stats4_2.16.0
>> [5] tools_2.16.0    XML_3.9-4       zlibbioc_1.2.0
>>
>>
>>
>>
>>
>>
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>> From:  Michael Lawrence <lawrence.michael at gene.com>
>> Date:  Thu, 12 Apr 2012 10:07:31 -0700
>> To:  NIBR <florian.hahne at novartis.com>
>> Cc:  Michael Lawrence <lawrence.michael at gene.com>, Sean Davis
>> <sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
>> "bioc-devel at r-project.org" <bioc-devel at r-project.org>
>> Subject:  Re: [Bioc-devel] scanTabix coercion to data.frame
>>
>>
>> Did you try the latest devel version?
>>
>> On Thu, Apr 12, 2012 at 9:29 AM, Hahne, Florian
>> <florian.hahne at novartis.com> wrote:
>>
>> Thanks, I gave it a shot and got this:
>> Error: evaluation nested too deeply: infinite recursion /
>> options(expressions=)?
>>
>>
>> Guess I'll stick with scanTabix for now  :-)
>> Florian
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>>
>> From: Michael Lawrence <lawrence.michael at gene.com>
>> Date: Thu, 12 Apr 2012 06:54:19 -0700
>> To: NIBR <florian.hahne at novartis.com>
>> Cc: Sean Davis <sdavis2 at mail.nih.gov>, Martin Morgan
>> <mtmorgan at fhcrc.org>,
>> "bioc-devel at r-project.org"
>>  <bioc-devel at r-project.org>
>> Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
>>
>>
>> You can use rtracklayer to import tabix files directly. If it's GFF or
>> BED, you can just use import(). For arbitrary tabular files, first cast
>> the path to a TabixFile, then pass it to import(). That last one is not
>> well tested. It uses the header information
>>  to know the starts, ends, etc.
>>
>> Michael
>>
>> On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
>> <florian.hahne at novartis.com> wrote:
>>
>> Sean, Martin, thanks for the suggestions. I guess a combination of the
>> two
>> would work well for me. I create my own tabix files and could certainly
>> stick the type information in the header. And I wasn't aware of
>> textConnection(), which seems to be performant enough to do what I want.
>> At least it is much better than my manual parsing...
>> One problem remains, though: the tabix files are being created from
>> within
>> R, and I don't think there is any support to add arbitrary header lines
>> available yet. Or is there?
>>
>> Florian
>>
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>>
>> On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:
>>
>> >On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org>
>> > wrote:
>> >> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>> >>>
>> >>> Hi all,
>> >>> I frequently get into the situation that I import data from a Tabix
>> >>>file
>> >>> using scanTabix and get a list of character vectors which I first
>> >>> need
>> >>>to
>> >>> split back into columns using strsplit, followed by some type
>> >>> coercion
>> >>>and
>> >>> lapply/sapply to actually get a list of data.frames which is what I'd
>> >>> really want out in the first place. I may be missing something here,
>> >>>but
>> >>> wouldn't it be possible to ask scanTabix for a list of data.frames
>> >>> directly, and maybe even providing a vector of data types to coerce
>> >>>into,
>> >>> a la 'colClasses' in read.table? It just seems to me that these
>> >>>operations
>> >>> could be done much more efficiently on the C level.
>> >>
>> >>
>> >> It's definitely poorly developed but one doesn't really want to
>> >>re-invent
>> >> too much of the parsing wheel. Does
>> >>
>> >>  res <- scanTabix("/foo.tbx")
>> >>  read.table(textConnection(res), header=TRUE, sep="\t")
>> >>
>> >> do the trick in a reasonably performant way? Obviously less than
>> >> ideal,
>> >>with
>> >> the data represented as character vectors and then as data.frame. A
>> >>better
>> >> solution (colClasses ==> data.frame) wouldn't be impossible, but
>> >>guessing
>> >> column types would be a lot of redundant work.
>> >
>> >Since tabix allows arbitrary header lines, one could store metadata in
>> >the first few lines and use that to store column info and classes.
>> >One can get at the header using Rsamtools
>> >headerTabix(TabixFile('foo.tbx')).  This is getting more toward
>> >developer-land than end-user, though, since the tabix file would need
>> >to be created with these uses in mind.
>> >
>> >Sean
>> >
>> >
>> >>> Thanks,
>> >>> Florian
>> >>>
>> >>>
>> >>> Florian Hahne
>> >>> Novartis Institute For Biomedical Research
>> >>> Translational Sciences / Preclinical Safety / PCS Informatics
>> >>> Expert Data Integration and Modeling Bioinformatics
>> >>> CHBS, WKL-135.2.26
>> >>> Novartis Institute For Biomedical Research, Werk Klybeck
>> >>> Klybeckstrasse 141
>> >>> CH-4057 Basel
>> >>> Switzerland
>> >>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>> >>> Email : florian.hahne at novartis.com
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel at r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> >>
>> >> --
>> >> Computational Biology
>> >> Fred Hutchinson Cancer Research Center
>> >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>> >>
>> >> Location: M1-B861
>> >> Telephone: 206 667-2793 <tel:206%20667-2793>
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>



More information about the Bioc-devel mailing list