[Rd] download.file does not process gz files correctly (truncates them?)
Martin Morgan
m@rtin@morg@n @ending from ro@wellp@rk@org
Thu May 3 15:40:02 CEST 2018
On 05/03/2018 05:48 AM, Joris Meys wrote:
> Dear all,
>
> I've been diving a bit deeper into this per request of Tomas Kalibra, and
> found the following :
>
> - the lock on the file is only after trying to read it using oligo, so
> that's not a R problem in itself. The problem is independent of extrenal
> packages.
>
> - using Windows' fc utility and cygwin's cmp utility I found out that every
> so often the download.file() function inserts an extra byte. There's no
> real obvious pattern in how these bytes are added, but the file downloaded
> using download.file() is actually larger (in this case by about 8 kb). The
> file xxx_inR.CEL.gz is read in using:
I believe the difference in mode = "w" vs "wb", and the reason this is
restricted to Windows downloads, is due to the difference in text file
line endings, where with mode="w", download.file (and many other
utilities outside R) recognize the "foo\n" as "foo\r\n". Obviously this
messes up binary files.
I guess in the CEL.gz file there are about 8k "\n" characters.
Henrik's suggestion (default = "wb") would introduce the complementary
problem -- text files would have incorrect line endings.
Martin
>
> setwd("E:/Temp/genexpr/Compare")
> id <- "GSM907854"
> flink <- paste0("
> https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz
> ")
> fname <- paste0(id,"_inR.CEL.gz")
> download.file(flink,
> destfile = fname)
>
> The file xxx_direct.CEL.gz is downloaded from
> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link
> at the bottom of the page).
>
> Output of dir in CMD:
>
> 05/03/2018 11:02 AM 4,529,547 GSM907854_direct.CEL.gz
> 05/03/2018 11:17 AM 4,537,668 GSM907854_inR.CEL.gz
>
> or from R :
>
>> diff(file.size(dir())) # contains both CEL files.
> [1] 8121
>
> Strangely enough I get the following message from download.file() :
>
> Content type 'application/octet-stream' length 4529547 bytes (4.3 MB)
> downloaded 4.3 MB
>
> So the reported length is exactly the same as if I would download the file
> directly, but the file on disk itself is larger. So it seems
> download.file() is adding bytes when saving the data on disk. This
> behaviour is independent of antivirus and/or firewalls turned on or off.
>
> Also keep in mind that these are NOT standard gzipped files. These files
> are a specific format for Affymetrix Human Gene 1.0 ST Arrays.
>
> If I need to run other tests, please let me know.
> Kind regards
>
> Joris
>
> On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorismeys at gmail.com> wrote:
>
>> Dear all,
>>
>> I've noticed by trying to download gz files from here :
>> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811
>>
>> At the bottom one can download GSM907811.CEL.gz . If I download this
>> manually and try
>>
>> oligo::read.celfiles("GSM907811.CEL.gz")
>>
>> everything works fine. (oligo is a bioConductor package)
>>
>> However, if I download using
>>
>> download.file("https://www.ncbi.nlm.nih.gov/geo/download/
>> ?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz",
>> destfile = "GSM907811.CEL.gz")
>>
>> The file is downloaded, but oligo::read.celfiles() returns the following
>> error:
>>
>> Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) :
>> End of gz file reached unexpectedly. Perhaps this file is truncated.
>>
>> Moreover, if I try to delete it after using download.file(), I get a
>> warning that permission is denied. I can only remove it using Windows file
>> explorer after I closed the R session, indicating that the connection is
>> still open. Yet, showConnections() doesn't show any open connections either.
>>
>> Session info below. Note that I started from a completely fresh R session.
>> oligo is needed due to the specific file format of these gz files. They're
>> not standard tarred files.
>>
>> Cheers
>> Joris
>>
>> Session Info
>> ------------------------------------------------------------
>> -------------------------
>>
>> R version 3.5.0 (2018-04-23)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> Running under: Windows >= 8 x64 (build 9200)
>>
>> Matrix products: default
>>
>> locale:
>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United
>> Kingdom.1252
>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>>
>> [5] LC_TIME=English_United Kingdom.1252
>>
>> attached base packages:
>> [1] stats4 parallel stats graphics grDevices utils datasets
>> methods
>> [9] base
>>
>> other attached packages:
>> [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8
>> oligo_1.44.0
>> [4] Biobase_2.39.2 oligoClasses_1.42.0
>> RSQLite_2.1.0
>> [7] Biostrings_2.48.0 XVector_0.19.9
>> IRanges_2.13.28
>> [10] S4Vectors_0.17.42 BiocGenerics_0.25.3
>>
>> loaded via a namespace (and not attached):
>> [1] Rcpp_0.12.16 compiler_3.5.0
>> [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5
>> [5] bitops_1.0-6 iterators_1.0.9
>> [7] tools_3.5.0 zlibbioc_1.25.0
>> [9] digest_0.6.15 bit_1.1-12
>> [11] memoise_1.1.0 preprocessCore_1.41.0
>> [13] lattice_0.20-35 ff_2.2-13
>> [15] pkgconfig_2.0.1 Matrix_1.2-14
>> [17] foreach_1.4.4 DelayedArray_0.5.31
>> [19] yaml_2.1.18 GenomeInfoDbData_1.1.0
>> [21] affxparser_1.52.0 bit64_0.9-7
>> [23] grid_3.5.0 BiocParallel_1.13.3
>> [25] blob_1.1.1 codetools_0.2-15
>> [27] matrixStats_0.53.1 GenomicRanges_1.31.23
>> [29] splines_3.5.0 SummarizedExperiment_1.9.17
>> [31] RCurl_1.95-4.10 affyio_1.49.2
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Department of Data Analysis and Mathematical Modelling
>> Ghent University
>> Coupure Links 653, B-9000 Gent (Belgium)
>>
>> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>>
>> -----------
>> Biowiskundedagen 2017-2018
>> http://www.biowiskundedagen.ugent.be/
>>
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>
>
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the R-devel
mailing list