[Bioc-devel] a day in the life of gwascat

Hervé Pagès hp@ge@ @end|ng |rom |redhutch@org
Thu Apr 30 20:29:17 CEST 2020


Everything works fine for me with quote="":

 > system.time(gwas 
<-read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv", 
quote=""))
    user  system elapsed
   4.435   0.052   4.487

 > dim(gwas)
[1] 179364     38

 > sessionInfo()
R version 4.0.0 Patched (2020-04-27 r78316)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r78316/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78316/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.0



On 4/30/20 04:48, Vincent Carey wrote:
> This file trips up fread around record 170349, inconsistently ... I haven't
> figured that out yet.
> readLines, strsplit may be the ultimate solution.
> 
> On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
> 
>> right, line 35265 of
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_gwas_api_search_downloads_alternative&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=sJ8FryxOQ9eoMTUfGAbArTqR9f5L51ynwMntfimjbpQ&e=  has an
>> unclosed quote in a field.
>>
>>   35265 2019-04-10      30804558        Grove J 2019-02-25      Nat Genet
>>      https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ncbi.nlm.nih.gov_pubmed_30804558&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=3yK9fsZtA_2bCHWktLA1ny1Wr7RRciU2QTOoE1xaWyg&e=     I       dentification of
>> common genetic risk variants for autism spectrum disorder.    Autism
>> spectrum disorder        18       ,381 European ancestry cases, 27,969
>> European ancestry controls       2,119 European ancestry cases, 142,379
>> Euro       pean ancestry controls                               Intergenic
>>
>> chr11:102751102"-?      chr11:102751102 0                       1       0.037
>>    8E-6    5.096910013008056                      1.1641443       [NR]    Illumina
>> [9112387] (imputed)    N       autism spectrum disorder        http:/
>>    /https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_efo_EFO-5F0003756&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=wWA7LPEZrntrqx5SpL9Y1q5_Kzo-w1L2Ymz6P_6jf00&e=     GCST007556      Genome-wide
>> genotyping array
>>
>> On Thu, Apr 30, 2020 at 6:59 AM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>
>>> I'd look instead at or around line 35264 for use of quotes, e.g., "3'
>>> DNA", and change the argument read.delim(quote = "") (though I never get
>>> that right so probably wrong again...). A comment character might also be a
>>> problem.
>>>
>>> If you point to the location of the file I could investigate further...
>>>
>>> Martin
>>>
>>> On 4/30/20, 6:55 AM, "Bioc-devel on behalf of Vincent Carey" <
>>> bioc-devel-bounces using r-project.org on behalf of stvjc using channing.harvard.edu>
>>> wrote:
>>>
>>>      The EBI GWAS catalog is large -- now the download is over 100MB for
>>> 179K
>>>      associations.  A "bug" in the
>>>      package was reported, so I acquired the file by hand.
>>>
>>>      > nn =
>>> read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
>>>      sep="\t")
>>>
>>>      *Warning message:*
>>>
>>>      *In scan(file = file, what = what, sep = sep, quote = quote, dec =
>>> dec,  :*
>>>
>>>      *  EOF within quoted string*
>>>
>>>      > dim(nn)
>>>
>>>      [1] 35264    38
>>>
>>>
>>>      The "bug" is the number 35264 ...
>>>
>>>
>>>      >
>>>
>>>      [1]+  Stopped                 R
>>>
>>>      %vjcair> wc gwas_cat*tsv
>>>
>>>        179365 13243516 120140148
>>>      gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv
>>>
>>>      %vjcair> vi gwas_cat*tsv
>>>
>>>      %vjcair> fg
>>>
>>>      R
>>>
>>>
>>>      > tail(nn)
>>>
>>>      *Error: C stack usage  98161262 is too close to the limit*
>>>
>>>
>>>      *Maybe my R needs to be updated.*
>>>
>>>
>>>      *If I use data.table::fread to consume the tsv over HTTP all seems
>>> well,
>>>      and perhaps*
>>>
>>>      *I will switch to that.*
>>>
>>>      --
>>>      The information in this e-mail is intended only for the
>>> ...{{dropped:18}}
>>>
>>>      _______________________________________________
>>>      Bioc-devel using r-project.org mailing list
>>>      https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=mnmrbhNqYbx1zpyO1DBuCFg14rcd8ZVFEKuCgPqfQAQ&e=
>>>
>>
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list