[Bioc-devel] a day in the life of gwascat

Vincent Carey @tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Thu Apr 30 20:48:28 CEST 2020


Thanks for checking this out.  I am leaning towards readr::read_tsv which
is very explicit about
untoward content

Browse[2]>

debug: tab = readr::read_tsv(tf)

Browse[2]>

*Parsed with column specification:*

*cols(*

*  .default = col_character(),*

*  `DATE ADDED TO CATALOG` = **col_date(format = "")**,*

*  PUBMEDID = **col_double()**,*

*  DATE = **col_date(format = "")**,*

*  UPSTREAM_GENE_DISTANCE = **col_double()**,*

*  DOWNSTREAM_GENE_DISTANCE = **col_double()**,*

*  MERGED = **col_double()**,*

*  SNP_ID_CURRENT = **col_double()**,*

*  INTERGENIC = **col_double()**,*

*  `P-VALUE` = **col_double()**,*

*  PVALUE_MLOG = **col_double()**,*

*  `OR or BETA` = **col_double()*

*)*

*See spec(...) for full column specifications.*

|=================================================================| 100%  114
MB

*Warning: 13 parsing failures.*

*  row            col               expected actual
                                                    file*

*21021 SNP_ID_CURRENT no trailing characters      *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*

*25725 SNP_ID_CURRENT no trailing characters      d
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*

*45770 SNP_ID_CURRENT no trailing characters      b
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*

*54548 SNP_ID_CURRENT no trailing characters      *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*

*54594 SNP_ID_CURRENT no trailing characters      *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*

*..... .............. ...................... ......
...............................................................................*

*See problems(...) for more details.*

On Thu, Apr 30, 2020 at 2:29 PM Hervé Pagès <hpages using fredhutch.org> wrote:

> Everything works fine for me with quote="":
>
>  > system.time(gwas
> <-read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
> quote=""))
>     user  system elapsed
>    4.435   0.052   4.487
>
>  > dim(gwas)
> [1] 179364     38
>
>  > sessionInfo()
> R version 4.0.0 Patched (2020-04-27 r78316)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.6 LTS
>
> Matrix products: default
> BLAS:   /home/hpages/R/R-4.0.r78316/lib/libRblas.so
> LAPACK: /home/hpages/R/R-4.0.r78316/lib/libRlapack.so
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.0
>
>
>
> On 4/30/20 04:48, Vincent Carey wrote:
> > This file trips up fread around record 170349, inconsistently ... I
> haven't
> > figured that out yet.
> > readLines, strsplit may be the ultimate solution.
> >
> > On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey <
> stvjc using channing.harvard.edu>
> > wrote:
> >
> >> right, line 35265 of
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_gwas_api_search_downloads_alternative&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=sJ8FryxOQ9eoMTUfGAbArTqR9f5L51ynwMntfimjbpQ&e=
> has an
> >> unclosed quote in a field.
> >>
> >>   35265 2019-04-10      30804558        Grove J 2019-02-25      Nat
> Genet
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ncbi.nlm.nih.gov_pubmed_30804558&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=3yK9fsZtA_2bCHWktLA1ny1Wr7RRciU2QTOoE1xaWyg&e=
>    I       dentification of
> >> common genetic risk variants for autism spectrum disorder.    Autism
> >> spectrum disorder        18       ,381 European ancestry cases, 27,969
> >> European ancestry controls       2,119 European ancestry cases, 142,379
> >> Euro       pean ancestry controls
>  Intergenic
> >>
> >> chr11:102751102"-?      chr11:102751102 0                       1
>  0.037
> >>    8E-6    5.096910013008056                      1.1641443       [NR]
>   Illumina
> >> [9112387] (imputed)    N       autism spectrum disorder        http:/
> >>    /
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_efo_EFO-5F0003756&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=wWA7LPEZrntrqx5SpL9Y1q5_Kzo-w1L2Ymz6P_6jf00&e=
>    GCST007556      Genome-wide
> >> genotyping array
> >>
> >> On Thu, Apr 30, 2020 at 6:59 AM Martin Morgan <mtmorgan.bioc using gmail.com>
> >> wrote:
> >>
> >>> I'd look instead at or around line 35264 for use of quotes, e.g., "3'
> >>> DNA", and change the argument read.delim(quote = "") (though I never
> get
> >>> that right so probably wrong again...). A comment character might also
> be a
> >>> problem.
> >>>
> >>> If you point to the location of the file I could investigate further...
> >>>
> >>> Martin
> >>>
> >>> On 4/30/20, 6:55 AM, "Bioc-devel on behalf of Vincent Carey" <
> >>> bioc-devel-bounces using r-project.org on behalf of
> stvjc using channing.harvard.edu>
> >>> wrote:
> >>>
> >>>      The EBI GWAS catalog is large -- now the download is over 100MB
> for
> >>> 179K
> >>>      associations.  A "bug" in the
> >>>      package was reported, so I acquired the file by hand.
> >>>
> >>>      > nn =
> >>> read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
> >>>      sep="\t")
> >>>
> >>>      *Warning message:*
> >>>
> >>>      *In scan(file = file, what = what, sep = sep, quote = quote, dec =
> >>> dec,  :*
> >>>
> >>>      *  EOF within quoted string*
> >>>
> >>>      > dim(nn)
> >>>
> >>>      [1] 35264    38
> >>>
> >>>
> >>>      The "bug" is the number 35264 ...
> >>>
> >>>
> >>>      >
> >>>
> >>>      [1]+  Stopped                 R
> >>>
> >>>      %vjcair> wc gwas_cat*tsv
> >>>
> >>>        179365 13243516 120140148
> >>>      gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv
> >>>
> >>>      %vjcair> vi gwas_cat*tsv
> >>>
> >>>      %vjcair> fg
> >>>
> >>>      R
> >>>
> >>>
> >>>      > tail(nn)
> >>>
> >>>      *Error: C stack usage  98161262 is too close to the limit*
> >>>
> >>>
> >>>      *Maybe my R needs to be updated.*
> >>>
> >>>
> >>>      *If I use data.table::fread to consume the tsv over HTTP all seems
> >>> well,
> >>>      and perhaps*
> >>>
> >>>      *I will switch to that.*
> >>>
> >>>      --
> >>>      The information in this e-mail is intended only for the
> >>> ...{{dropped:18}}
> >>>
> >>>      _______________________________________________
> >>>      Bioc-devel using r-project.org mailing list
> >>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=mnmrbhNqYbx1zpyO1DBuCFg14rcd8ZVFEKuCgPqfQAQ&e=
> >>>
> >>
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
The information in this e-mail is intended only for the ...{{dropped:18}}



More information about the Bioc-devel mailing list