[Bioc-devel] a day in the life of gwascat

Vincent Carey @tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Thu Apr 30 13:48:22 CEST 2020


This file trips up fread around record 170349, inconsistently ... I haven't
figured that out yet.
readLines, strsplit may be the ultimate solution.

On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey <stvjc using channing.harvard.edu>
wrote:

> right, line 35265 of
> http://www.ebi.ac.uk/gwas/api/search/downloads/alternative has an
> unclosed quote in a field.
>
>  35265 2019-04-10      30804558        Grove J 2019-02-25      Nat Genet
>     www.ncbi.nlm.nih.gov/pubmed/30804558    I       dentification of
> common genetic risk variants for autism spectrum disorder.    Autism
> spectrum disorder        18       ,381 European ancestry cases, 27,969
> European ancestry controls       2,119 European ancestry cases, 142,379
> Euro       pean ancestry controls                               Intergenic
>
> chr11:102751102"-?      chr11:102751102 0                       1       0.037
>   8E-6    5.096910013008056                      1.1641443       [NR]    Illumina
> [9112387] (imputed)    N       autism spectrum disorder        http:/
>   /www.ebi.ac.uk/efo/EFO_0003756    GCST007556      Genome-wide
> genotyping array
>
> On Thu, Apr 30, 2020 at 6:59 AM Martin Morgan <mtmorgan.bioc using gmail.com>
> wrote:
>
>> I'd look instead at or around line 35264 for use of quotes, e.g., "3'
>> DNA", and change the argument read.delim(quote = "") (though I never get
>> that right so probably wrong again...). A comment character might also be a
>> problem.
>>
>> If you point to the location of the file I could investigate further...
>>
>> Martin
>>
>> On 4/30/20, 6:55 AM, "Bioc-devel on behalf of Vincent Carey" <
>> bioc-devel-bounces using r-project.org on behalf of stvjc using channing.harvard.edu>
>> wrote:
>>
>>     The EBI GWAS catalog is large -- now the download is over 100MB for
>> 179K
>>     associations.  A "bug" in the
>>     package was reported, so I acquired the file by hand.
>>
>>     > nn =
>> read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
>>     sep="\t")
>>
>>     *Warning message:*
>>
>>     *In scan(file = file, what = what, sep = sep, quote = quote, dec =
>> dec,  :*
>>
>>     *  EOF within quoted string*
>>
>>     > dim(nn)
>>
>>     [1] 35264    38
>>
>>
>>     The "bug" is the number 35264 ...
>>
>>
>>     >
>>
>>     [1]+  Stopped                 R
>>
>>     %vjcair> wc gwas_cat*tsv
>>
>>       179365 13243516 120140148
>>     gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv
>>
>>     %vjcair> vi gwas_cat*tsv
>>
>>     %vjcair> fg
>>
>>     R
>>
>>
>>     > tail(nn)
>>
>>     *Error: C stack usage  98161262 is too close to the limit*
>>
>>
>>     *Maybe my R needs to be updated.*
>>
>>
>>     *If I use data.table::fread to consume the tsv over HTTP all seems
>> well,
>>     and perhaps*
>>
>>     *I will switch to that.*
>>
>>     --
>>     The information in this e-mail is intended only for the
>> ...{{dropped:18}}
>>
>>     _______________________________________________
>>     Bioc-devel using r-project.org mailing list
>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

-- 
The information in this e-mail is intended only for the ...{{dropped:18}}



More information about the Bioc-devel mailing list