[Bioc-devel] a day in the life of gwascat
Hervé Pagès
hp@ge@ @end|ng |rom |redhutch@org
Thu Apr 30 20:29:17 CEST 2020
Everything works fine for me with quote="":
> system.time(gwas
<-read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
quote=""))
user system elapsed
4.435 0.052 4.487
> dim(gwas)
[1] 179364 38
> sessionInfo()
R version 4.0.0 Patched (2020-04-27 r78316)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.0.r78316/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78316/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.0
On 4/30/20 04:48, Vincent Carey wrote:
> This file trips up fread around record 170349, inconsistently ... I haven't
> figured that out yet.
> readLines, strsplit may be the ultimate solution.
>
> On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
>
>> right, line 35265 of
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_gwas_api_search_downloads_alternative&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=sJ8FryxOQ9eoMTUfGAbArTqR9f5L51ynwMntfimjbpQ&e= has an
>> unclosed quote in a field.
>>
>> 35265 2019-04-10 30804558 Grove J 2019-02-25 Nat Genet
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ncbi.nlm.nih.gov_pubmed_30804558&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=3yK9fsZtA_2bCHWktLA1ny1Wr7RRciU2QTOoE1xaWyg&e= I dentification of
>> common genetic risk variants for autism spectrum disorder. Autism
>> spectrum disorder 18 ,381 European ancestry cases, 27,969
>> European ancestry controls 2,119 European ancestry cases, 142,379
>> Euro pean ancestry controls Intergenic
>>
>> chr11:102751102"-? chr11:102751102 0 1 0.037
>> 8E-6 5.096910013008056 1.1641443 [NR] Illumina
>> [9112387] (imputed) N autism spectrum disorder http:/
>> /https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_efo_EFO-5F0003756&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=wWA7LPEZrntrqx5SpL9Y1q5_Kzo-w1L2Ymz6P_6jf00&e= GCST007556 Genome-wide
>> genotyping array
>>
>> On Thu, Apr 30, 2020 at 6:59 AM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>
>>> I'd look instead at or around line 35264 for use of quotes, e.g., "3'
>>> DNA", and change the argument read.delim(quote = "") (though I never get
>>> that right so probably wrong again...). A comment character might also be a
>>> problem.
>>>
>>> If you point to the location of the file I could investigate further...
>>>
>>> Martin
>>>
>>> On 4/30/20, 6:55 AM, "Bioc-devel on behalf of Vincent Carey" <
>>> bioc-devel-bounces using r-project.org on behalf of stvjc using channing.harvard.edu>
>>> wrote:
>>>
>>> The EBI GWAS catalog is large -- now the download is over 100MB for
>>> 179K
>>> associations. A "bug" in the
>>> package was reported, so I acquired the file by hand.
>>>
>>> > nn =
>>> read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
>>> sep="\t")
>>>
>>> *Warning message:*
>>>
>>> *In scan(file = file, what = what, sep = sep, quote = quote, dec =
>>> dec, :*
>>>
>>> * EOF within quoted string*
>>>
>>> > dim(nn)
>>>
>>> [1] 35264 38
>>>
>>>
>>> The "bug" is the number 35264 ...
>>>
>>>
>>> >
>>>
>>> [1]+ Stopped R
>>>
>>> %vjcair> wc gwas_cat*tsv
>>>
>>> 179365 13243516 120140148
>>> gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv
>>>
>>> %vjcair> vi gwas_cat*tsv
>>>
>>> %vjcair> fg
>>>
>>> R
>>>
>>>
>>> > tail(nn)
>>>
>>> *Error: C stack usage 98161262 is too close to the limit*
>>>
>>>
>>> *Maybe my R needs to be updated.*
>>>
>>>
>>> *If I use data.table::fread to consume the tsv over HTTP all seems
>>> well,
>>> and perhaps*
>>>
>>> *I will switch to that.*
>>>
>>> --
>>> The information in this e-mail is intended only for the
>>> ...{{dropped:18}}
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=oM6e8C3QAbH860EUSfLCLlCa2Q2xqXbeOojfJo_0GDg&s=mnmrbhNqYbx1zpyO1DBuCFg14rcd8ZVFEKuCgPqfQAQ&e=
>>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list