[Bioc-devel] a day in the life of gwascat
Vincent Carey
@tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Thu Apr 30 20:48:28 CEST 2020
Thanks for checking this out. I am leaning towards readr::read_tsv which
is very explicit about
untoward content
Browse[2]>
debug: tab = readr::read_tsv(tf)
Browse[2]>
*Parsed with column specification:*
*cols(*
* .default = col_character(),*
* `DATE ADDED TO CATALOG` = **col_date(format = "")**,*
* PUBMEDID = **col_double()**,*
* DATE = **col_date(format = "")**,*
* UPSTREAM_GENE_DISTANCE = **col_double()**,*
* DOWNSTREAM_GENE_DISTANCE = **col_double()**,*
* MERGED = **col_double()**,*
* SNP_ID_CURRENT = **col_double()**,*
* INTERGENIC = **col_double()**,*
* `P-VALUE` = **col_double()**,*
* PVALUE_MLOG = **col_double()**,*
* `OR or BETA` = **col_double()*
*)*
*See spec(...) for full column specifications.*
|=================================================================| 100% 114
MB
*Warning: 13 parsing failures.*
* row col expected actual
file*
*21021 SNP_ID_CURRENT no trailing characters *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*
*25725 SNP_ID_CURRENT no trailing characters d
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*
*45770 SNP_ID_CURRENT no trailing characters b
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*
*54548 SNP_ID_CURRENT no trailing characters *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*
*54594 SNP_ID_CURRENT no trailing characters *
'/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T//Rtmpi3B4HE/filecb946948e8fb'*
*..... .............. ...................... ......
...............................................................................*
*See problems(...) for more details.*
On Thu, Apr 30, 2020 at 2:29 PM Hervé Pagès <hpages using fredhutch.org> wrote:
> Everything works fine for me with quote="":
>
> > system.time(gwas
> <-read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
> quote=""))
> user system elapsed
> 4.435 0.052 4.487
>
> > dim(gwas)
> [1] 179364 38
>
> > sessionInfo()
> R version 4.0.0 Patched (2020-04-27 r78316)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.6 LTS
>
> Matrix products: default
> BLAS: /home/hpages/R/R-4.0.r78316/lib/libRblas.so
> LAPACK: /home/hpages/R/R-4.0.r78316/lib/libRlapack.so
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.0
>
>
>
> On 4/30/20 04:48, Vincent Carey wrote:
> > This file trips up fread around record 170349, inconsistently ... I
> haven't
> > figured that out yet.
> > readLines, strsplit may be the ultimate solution.
> >
> > On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey <
> stvjc using channing.harvard.edu>
> > wrote:
> >
> >> right, line 35265 of
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_gwas_api_search_downloads_alternative&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=sJ8FryxOQ9eoMTUfGAbArTqR9f5L51ynwMntfimjbpQ&e=
> has an
> >> unclosed quote in a field.
> >>
> >> 35265 2019-04-10 30804558 Grove J 2019-02-25 Nat
> Genet
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ncbi.nlm.nih.gov_pubmed_30804558&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=3yK9fsZtA_2bCHWktLA1ny1Wr7RRciU2QTOoE1xaWyg&e=
> I dentification of
> >> common genetic risk variants for autism spectrum disorder. Autism
> >> spectrum disorder 18 ,381 European ancestry cases, 27,969
> >> European ancestry controls 2,119 European ancestry cases, 142,379
> >> Euro pean ancestry controls
> Intergenic
> >>
> >> chr11:102751102"-? chr11:102751102 0 1
> 0.037
> >> 8E-6 5.096910013008056 1.1641443 [NR]
> Illumina
> >> [9112387] (imputed) N autism spectrum disorder http:/
> >> /
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_efo_EFO-5F0003756&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=wWA7LPEZrntrqx5SpL9Y1q5_Kzo-w1L2Ymz6P_6jf00&e=
> GCST007556 Genome-wide
> >> genotyping array
> >>
> >> On Thu, Apr 30, 2020 at 6:59 AM Martin Morgan <mtmorgan.bioc using gmail.com>
> >> wrote:
> >>
> >>> I'd look instead at or around line 35264 for use of quotes, e.g., "3'
> >>> DNA", and change the argument read.delim(quote = "") (though I never
> get
> >>> that right so probably wrong again...). A comment character might also
> be a
> >>> problem.
> >>>
> >>> If you point to the location of the file I could investigate further...
> >>>
> >>> Martin
> >>>
> >>> On 4/30/20, 6:55 AM, "Bioc-devel on behalf of Vincent Carey" <
> >>> bioc-devel-bounces using r-project.org on behalf of
> stvjc using channing.harvard.edu>
> >>> wrote:
> >>>
> >>> The EBI GWAS catalog is large -- now the download is over 100MB
> for
> >>> 179K
> >>> associations. A "bug" in the
> >>> package was reported, so I acquired the file by hand.
> >>>
> >>> > nn =
> >>> read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
> >>> sep="\t")
> >>>
> >>> *Warning message:*
> >>>
> >>> *In scan(file = file, what = what, sep = sep, quote = quote, dec =
> >>> dec, :*
> >>>
> >>> * EOF within quoted string*
> >>>
> >>> > dim(nn)
> >>>
> >>> [1] 35264 38
> >>>
> >>>
> >>> The "bug" is the number 35264 ...
> >>>
> >>>
> >>> >
> >>>
> >>> [1]+ Stopped R
> >>>
> >>> %vjcair> wc gwas_cat*tsv
> >>>
> >>> 179365 13243516 120140148
> >>> gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv
> >>>
> >>> %vjcair> vi gwas_cat*tsv
> >>>
> >>> %vjcair> fg
> >>>
> >>> R
> >>>
> >>>
> >>> > tail(nn)
> >>>
> >>> *Error: C stack usage 98161262 is too close to the limit*
> >>>
> >>>
> >>> *Maybe my R needs to be updated.*
> >>>
> >>>
> >>> *If I use data.table::fread to consume the tsv over HTTP all seems
> >>> well,
> >>> and perhaps*
> >>>
> >>> *I will switch to that.*
> >>>
> >>> --
> >>> The information in this e-mail is intended only for the
> >>> ...{{dropped:18}}
> >>>
> >>> _______________________________________________
> >>> Bioc-devel using r-project.org mailing list
> >>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=mnmrbhNqYbx1zpyO1DBuCFg14rcd8ZVFEKuCgPqfQAQ&e=
> >>>
> >>
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
The information in this e-mail is intended only for the ...{{dropped:18}}
More information about the Bioc-devel
mailing list