[BioC] GEOquery Error : Retrieved files corrupted?
axel.klenk at actelion.com
axel.klenk at actelion.com
Wed Feb 8 09:56:53 CET 2012
Dear Ying and Sean,
a wild guess based on the problem description that sounds too familiar:
corruption of binary files is likely to occur if they are transferred via
ftp text
mode instead of binary mode from Linux/UNIX to Windows.
Hmmm, but then getGEOSuppFiles() would never have worked on Windows...
maybe something has changed recently in GEOquery or the underlying code
for file transfer?
Cheers,
- axel
Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
Switzerland
From:
ying chen <ying_chen at live.com>
To:
<sdavis2 at mail.nih.gov>
Cc:
bioconductor at r-project.org
Date:
07.02.2012 20:18
Subject:
Re: [BioC] GEOquery Error : Retrieved files corrupted?
Sent by:
bioconductor-bounces at r-project.org
Hi Sean, Thanks a lot for the help. I switched to ubuntu on virtualbox and
now have no problem with raw data retrieved through GEOquery. But I just
repeated in Windows 7 with R2.14, and my problem is still there. But now
at least I can stick with ubuntu. Thanks, Ying
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base
packages:
[1] stats graphics grDevices utils datasets methods base other
attached packages:
[1] GEOquery_2.20.8 Biobase_2.14.0 loaded via a namespace (and not
attached):
[1] RCurl_1.9-5 XML_3.9-4
>
> Date: Tue, 7 Feb 2012 14:00:10 -0500
> Subject: Re: [BioC] GEOquery Error : Retrieved files corrupted?
> From: sdavis2 at mail.nih.gov
> To: ying_chen at live.com
> CC: bioconductor at r-project.org
>
> On Tue, Feb 7, 2012 at 11:42 AM, ying chen <ying_chen at live.com> wrote:
> >
> > Hi, I tried to retrieve GEO dataset with the GEOquery package as
following:
> > file <- getGEOSuppFiles('GSE10046')
> > But it seems that every raw data file I got by this method is
corrupted. For example, when I tried to extract the GSE10046_RAW.tar, I
got the following error message: Can not
open file "H:\...\GSE10046_RAW.tar" as archive. The GSE10046_RAW.tar I got
through GEOquery is 27,433 KB. The same dataset I retrieved from GEO
website is 27,350KB and I can extract it with no problem. I had retrieved
more than 70 dataset raw files by GEOquery and all have the same problem.
Anyone has any suggestion what went wrong? Thanks a lot for the help! Ying
>
>
> Hi, Ying.
>
> I am not able to reproduce your error on either Mac or two flavors of
> linux. I don't have access to a Windows version of R, but I'll see if
> I can get access in the next few days to check.
>
> Sorry I can't be more helpful right now.
> Sean
>
>
>
> > sessionInfo()
> > R version 2.14.0 (2011-10-31)
> > Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
> > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252
> > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United States.1252 attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
other attached packages:
> > [1] GEOquery_2.20.8 Biobase_2.14.0 loaded via a namespace (and not
attached):
> > [1] RCurl_1.9-5.1 XML_3.9-4.1
> >>
> >> From: ying_chen at live.com
> >> To: sdavis2 at mail.nih.gov
> >> Date: Mon, 6 Feb 2012 11:35:07 -0500
> >> CC: bioconductor at r-project.org
> >> Subject: Re: [BioC] GEOquery Error
> >>
> >>
> >> Hi Sean, Thanks a lot for the help. I checked my computer and I still
have 253GB space left on my hard drive. I tried to retrieve the data over
the weekend, but always had the same problem. I just tried to run it again
to test on 10 gse ids. At first it gave some error message, but finished
the first dataset. Then the program complained about the failure to open
the destfile, which seems odd to me as this is the file the program is
supposed to download. Now it seems to me that I can download dataset one
by one using getGEOSuppFiles, but it always failed if I tried to use
sapply with GetGEOSuppFiles to set up to download a list of datasets. Any
suggestion? Thanks a lot for the help! Ying
> >> > files <- sapply(gseids[1:10],getGEOSuppFiles)
> >> Error in dir.create(GEO) : invalid 'path' argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
> >> trying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_RAW.tar'
> >> ftp data connection made, file length 605009920 bytes
> >> opened URL
> >> downloaded 577.0 Mbtrying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_discovery_clinical_info.txt.gz'
> >> ftp data connection made, file length 1785 bytes
> >> opened URL
> >> downloaded 1785 bytestrying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_validation_clinical_info.txt.gz'
> >> ftp data connection made, file length 1681 bytes
> >> opened URL
> >> downloaded 1681 bytestrying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//filelist.txt'
> >> ftp data connection made, file length 5871 bytes
> >> opened URL
> >> downloaded 5871 bytesError in dir.create(GEO) : invalid 'path'
argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE12790/"
> >> Error in download.file(file.path(url, i), destfile =
file.path(storedir, :
> >> cannot open destfile
'H:/My_DataSets/BreastCancerDataSet/GSE12790/GSE12790_RAW.tar', reason 'No
such file or directory'
> >> > sessionInfo()
> >> R version 2.14.0 (2011-10-31)
> >> Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
> >> [1] LC_COLLATE=English_United States.1252
> >> [2] LC_CTYPE=English_United States.1252
> >> [3] LC_MONETARY=English_United States.1252
> >> [4] LC_NUMERIC=C
> >> [5] LC_TIME=English_United States.1252 attached base packages:
> >> [1] stats graphics grDevices utils datasets methods base
other attached packages:
> >> [1] GEOquery_2.20.8 Biobase_2.14.0 BiocInstaller_1.2.1loaded
via a namespace (and not attached):
> >> [1] RCurl_1.9-5.1 tools_2.14.0 XML_3.9-4.1
> >> > files <- sapply(gseids[4:10],getGEOSuppFiles)
> >> Error in dir.create(GEO) : invalid 'path' argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
> >> Error in function (type, msg, asError = TRUE) :
> >> Server denied you to change to the given directory
> >> > files <- getGEOSuppFiles('GSE9195')
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
> >> trying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_RAW.tar'
> >> ftp data connection made, file length 658708480 bytes
> >> opened URL
> >> downloaded 628.2 Mbtrying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION.RData'
> >> ftp data connection made, file length 59288200 bytes
> >> opened URL
> >> downloaded 56.5 Mbtrying URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
> >> Error in download.file(file.path(url, i), destfile =
file.path(storedir, :
> >> cannot open URL
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
> >> >
> >> > Date: Thu, 2 Feb 2012 23:46:59 -0500
> >> > Subject: Re: [BioC] GEOquery Error
> >> > From: sdavis2 at mail.nih.gov
> >> > To: ying_chen at live.com
> >> > CC: bioconductor at r-project.org
> >> >
> >> > On Thu, Feb 2, 2012 at 11:37 PM, ying chen <ying_chen at live.com>
wrote:
> >> > > Hi Sean,
> >> > >
> >> > > Thanks a lot for the suggestion. I just tried simple test (>
files <-
> >> > > getGEOSuppFiles("GSE23720")) and the problem is gone.
> >> > >
> >> > > But when I tried to get a lot files at once, I got the following
error
> >> > > message:
> >> > >
> >> > >> gseids
> >> > > [1] GSE17907 GSE30010 GSE12790 GSE20711 GSE28821 GSE18864
GSE9195
> >> > > GSE29431
> >> > > [9] GSE14020 GSE7904 GSE18728 GSE15181 GSE16391 GSE12777
GSE23593
> >> > > GSE22035
> >> > > [17] GSE19383 GSE10281 GSE21217 GSE29672 GSE14986 GSE15026
GSE12763
> >> > > GSE11001
> >> > > [25] GSE14017 GSE22513 GSE7515 GSE28796 GSE26910 GSE23994
GSE19639
> >> > > GSE19697
> >> > > [33] GSE15477 GSE10270 GSE3893 GSE13787 GSE11078 GSE8977
GSE21834 GSE6885
> >> > > [41] GSE24468 GSE20266 GSE21422 GSE3156 GSE22250 GSE18571
GSE11352 GSE7382
> >> > > [49] GSE13806 GSE8565 GSE15619 GSE8597 GSE29832 GSE11791
GSE5102
> >> > > GSE28645
> >> > > [57] GSE32160 GSE28789 GSE18331 GSE23640 GSE23399 GSE9086
GSE22865
> >> > > GSE26298
> >> > > [65] GSE15893 GSE20086 GSE11324 GSE5116 GSE10879 GSE25407
GSE7700
> >> > > GSE18912
> >> > > [73] GSE15043 GSE27515 GSE19777 GSE21832 GSE18070 GSE11506
GSE23921
> >> > > GSE23905
> >> > > [81] GSE32158 GSE28305 GSE25162 GSE28415 GSE9015 GSE6800
GSE6548
> >> > > GSE32161
> >> > > [89] GSE24249 GSE30775 GSE26884 GSE24473 GSE20719 GSE17636
GSE18773
> >> > > GSE18931
> >> > > [97] GSE18146 GSE16070 GSE16080 GSE11683 GSE10046 GSE9747
GSE15749
> >> > > GSE22664
> >> > > [105] GSE21066 GSE9586 GSE17832 GSE11330 GSE17889 GSE12199
GSE28089
> >> > > GSE31448
> >> > > [113] GSE10810 GSE9196 GSE22840 GSE33658 GSE25487 GSE22544
GSE27220
> >> > > GSE11581
> >> > > 120 Levels: GSE10046 GSE10270 GSE10281 GSE10810 GSE10879 GSE11001
...
> >> > > GSE9747
> >> > >> files <- sapply(gseids,getGEOSuppFiles,makeDirectory = TRUE,
baseDir =
> >> > >> getwd()
> >> > > + )
> >> > > Error in dir.create(GEO) : invalid 'path' argument
> >> > > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE17907/"
> >> > > % Total % Received % Xferd Average Speed Time Time
Time
> >> > > Current
> >> > > Dload Upload Total Spent
Left
> >> > > Speed
> >> > > 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > >
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/GSE17907_RA
> >> > > Warning: W.tar: No such file or directory
> >> > > 0 328M 0 2896 0 0 3027 0 31:34:35 --:--:--
31:34:35
> >> > > 3415
> >> > > curl: (23) Failed writing body (0 != 2896)
> >> > > % Total % Received % Xferd Average Speed Time Time
Time
> >> > > Current
> >> > > Dload Upload Total Spent
Left
> >> > > Speed
> >> > > 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > >
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/filelist.tx
> >> > > Warning: t: No such file or directory
> >> > > 24 5979 24 1448 0 0 2495 0 0:00:02 --:--:--
0:00:02
> >> > > 3061
> >> > > curl: (23) Failed writing body (0 != 1448)
> >> > > Error in dir.create(GEO) : invalid 'path' argument
> >> > > In addition: Warning messages:
> >> > > 1: In download.file(file.path(url, i), destfile =
file.path(storedir, :
> >> > > download had nonzero exit status
> >> > > 2: In download.file(file.path(url, i), destfile =
file.path(storedir, :
> >> > > download had nonzero exit status
> >> > > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
> >> > > % Total % Received % Xferd Average Speed Time Time
Time
> >> > > Current
> >> > > Dload Upload Total Spent
Left
> >> > > Speed
> >> > > 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > >
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_RA
> >> > > Warning: W.tar: No such file or directory
> >> > > 0 576M 0 2896 0 0 5191 0 32:22:29 --:--:--
32:22:29
> >> > > 6464
> >> > > curl: (23) Failed writing body (0 != 2896)
> >> > > % Total % Received % Xferd Average Speed Time Time
Time
> >> > > Current
> >> > > Dload Upload Total Spent
Left
> >> > > Speed
> >> > > 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > >
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_di
> >> > > Warning: scovery_clinical_info.txt.gz: No such file or directory
> >> > > 81 1785 81 1448 0 0 3009 0 --:--:-- --:--:--
--:--:--
> >> > > 3506
> >> > > 81 1785 81 1448 0 0 1978 0 --:--:-- --:--:--
--:--:--
> >> > > 1978curl: (23) Failed writing body (0 != 1448)
> >> >
> >> > It is hard to tell for sure, but I think you might be out of disk
> >> > space locally. When you get the error, check to see if you have
space
> >> > left on the device to which you are saving. GEOquery should work
fine
> >> > in a loop like this.
> >> >
> >> > Sean
> >> >
> >> >
> >> > > After I killed this job and tried:
> >> > >
> >> > >> file <- getGEOSuppFiles("GSE17907")
> >> > >
> >> > > I had no problem at all.
> >> > >
> >> > > I really do not know what's wrong with the sapply() setting.
> >> > >
> >> > > Any suggestion?
> >> > >
> >> > > Thanks a lot for the help!
> >> > >
> >> > > Ying
> >> > >
> >> > >> Date: Thu, 2 Feb 2012 12:48:56 -0500
> >> > >> Subject: Re: [BioC] GEOquery Error
> >> > >> From: sdavis2 at mail.nih.gov
> >> > >> To: ying_chen at live.com
> >> > >> CC: bioconductor at r-project.org
> >> > >
> >> > >>
> >> > >> On Thu, Feb 2, 2012 at 12:38 PM, ying chen <ying_chen at live.com>
wrote:
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > Hi,
> >> > >> >
> >> > >> > I want to use GEOquery package to get the raw files of a lot
GEO
> >> > >> > datasets at once ( > files <- sapply(mydata$GSE_ID,
getGEOSuppFiles) ), but
> >> > >> > I got the following error message when I did a simple test
run. Any
> >> > >> > suggestion?
> >> > >> >
> >> > >>
> >> > >> Hi, Ying.
> >> > >>
> >> > >> This is not a GEOquery issue. The directory housing the data is
not
> >> > >> on the FTP site. NCBI GEO periodically rebuilds stuff on the
site.
> >> > >> That might be occurring now. I'd suggest emailing NCBI GEO
directly
> >> > >> if you are in a hurry. Alternatively, wait an hour or two to see
if
> >> > >> the problem is resolved.
> >> > >>
> >> > >> Sean
> >> > >>
> >> > >>
> >> > >> >> library(GEOquery)
> >> > >> > Loading required package: Biobase
> >> > >> > Welcome to Bioconductor
> >> > >> > Vignettes contain introductory material. To view, type
> >> > >> > 'browseVignettes()'. To cite Bioconductor, see
> >> > >> > 'citation("Biobase")' and for packages 'citation("pkgname")'.
> >> > >> > Setting options('download.file.method.GEOquery'='curl')
> >> > >> >> files <- getGEOSuppFiles("GSE23720")
> >> > >> > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE23720/"
> >> > >> > Error in function (type, msg, asError = TRUE) :
> >> > >> > Server denied you to change to the given directory
> >> > >> >> sessionInfo()
> >> > >> > R version 2.14.1 (2011-12-22)
> >> > >> > Platform: x86_64-pc-linux-gnu (64-bit)
> >> > >> > locale:
> >> > >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> > >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> >> > >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> >> > >> > [7] LC_PAPER=C LC_NAME=C
> >> > >> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> >> > >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >> > >> > attached base packages:
> >> > >> > [1] stats graphics grDevices utils datasets methods
base
> >> > >> > other attached packages:
> >> > >> > [1] GEOquery_2.20.8 Biobase_2.14.0
> >> > >> > loaded via a namespace (and not attached):
> >> > >> > [1] RCurl_1.9-5 XML_3.9-4
> >> > >> >>
> >> > >> >
> >> > >> >
> >> > >> > [[alternative HTML version deleted]]
> >> > >> >
> >> > >> > _______________________________________________
> >> > >> > Bioconductor mailing list
> >> > >> > Bioconductor at r-project.org
> >> > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> > >> > Search the archives:
> >> > >> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com
More information about the Bioconductor
mailing list