[BioC] GEOquery error

James W. MacDonald jmacdon at uw.edu
Fri May 2 20:00:51 CEST 2014


After some further testing, it doesn't appear to be an ftp problem 
directly, and comes down to the getURL() step in getDirectoryListing():

 > 
GEOquery:::getDirListing("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/")
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
Error in function (type, msg, asError = TRUE)  : couldn't connect to host

But this works with other FTP sites, such as in R's internet test file:

 > GEOquery:::getDirListing("ftp://ftp.stats.ox.ac.uk/pub/datasets/csb/")
ftp://ftp.stats.ox.ac.uk/pub/datasets/csb/
  [1] "HEADER.html"  "ch10.dat"     "ch10.sas"     "ch10.txt" 
"ch11a.dat"    "ch11a.sas"    "ch11a.txt"    "ch11b.dat"
  [9] "ch11b.sas"    "ch11b.txt"    "ch12.dat.gz"  "ch12.sas" 
"ch12.txt"     "ch13.dat.gz"  "ch13.sas"     "ch13.txt"
[17] "ch14.dat"     "ch14.sas"     "ch14.txt"     "ch15.dat.gz" 
"ch15.sas"     "ch15.txt"     "ch16a.dat"    "ch16a.sas"
[25] "ch16a.txt"    "ch16b.dat"    "ch16b.sas"    "ch16b.txt" 
"ch17.dat"     "ch17.sas"     "ch17.txt"     "ch18a.dat"
[33] "ch18a.sas"    "ch18a.txt"    "ch18b.dat.gz" "ch18b.sas" 
"ch18b.txt"    "ch19.sas"     "ch19.txt"     "ch19a.dat.gz"
[41] "ch19b.dat.gz" "ch19c.dat.gz" "ch19d.dat.gz" "ch19e.dat.gz" 
"ch19f.dat.gz" "ch19g.dat.gz" "ch1a.dat"     "ch1a.sas"
[49] "ch1a.txt"     "ch1b.dat"     "ch1b.sas"     "ch1b.txt" 
"ch2.dat"      "ch2.sas"      "ch2.txt"      "ch20.dat.gz"
[57] "ch20.sas"     "ch20.txt"     "ch21a.dat.gz" "ch21a.sas" 
"ch21a.txt"    "ch21b.dat.gz" "ch21b.sas"    "ch21b.txt"
[65] "ch3a.dat"     "ch3a.sas"     "ch3a.txt"     "ch3b.dat" 
"ch3b.sas"     "ch3b.txt"     "ch4a.dat"     "ch4a.sas"
[73] "ch4a.txt"     "ch4b.dat"     "ch4b.sas"     "ch4b.txt" 
"ch5.dat.gz"   "ch5.sas"      "ch5.txt"      "ch6.dat"
[81] "ch6.sas"      "ch6.txt"      "ch7.dat.gz"   "ch7.sas" 
"ch7.txt"      "ch8.dat"      "ch8.sas"      "ch8.txt"
[89] "ch9.dat.gz"   "ch9.sas"      "ch9.txt"      "index.html"

or Ensembl:

 > GEOquery:::getDirListing("ftp://ftp.ensembl.org")
ftp://ftp.ensembl.org
[1] "ls-lR.gz"              "ls-lR.Z" "pub"                   
"quota.group" "quota.user"
[6] "update-sym-links"      "update-sym-links_orig"

or other random US government ftp sites:

 > GEOquery:::getDirListing("ftp://ftp.wcc.nrcs.usda.gov")
ftp://ftp.wcc.nrcs.usda.gov
  [1] "BB_Test"      "data"         "downloads"    "fieldops" 
"gis"          "images"       "pub"          "publications"
  [9] "snowschool"   "states"       "support"      "tmp" "watershed"    
"wcs_info"     "welcome.msg"  "wntsc"

So I wonder if it is a change at NCBI?

Best,

Jim




On 5/2/2014 1:15 PM, James W. MacDonald wrote:
> Hi Sean,
>
> This all works on Linux, and obviously on MacOS for you, but on 
> Windows 7, not so much:
>
> > gpl <- getGEO("GPL90")
> File stored at:
> C:\Users\BIOINF~1\AppData\Local\Temp\Rtmp4UPr1i/GPL90.soft
> Warning message:
> In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = 
> getOption("download.file.method.GEOquery")) :
>   downloaded length 9476281 != reported length 200
>
> But the gpl object looks OK, so I guess the reported length is wrong.
>
> > geoq <- getGEO("GSE9514", GSEMatrix = FALSE)
> File stored at:
> C:\Users\BIOINF~1\AppData\Local\Temp\Rtmp4UPr1i/GSE9514.soft.gz
> Parsing....
> Found 9 entities...
> GPL90 (1 of 9 entities)
> GSM241146 (2 of 9 entities)
> GSM241147 (3 of 9 entities)
> GSM241148 (4 of 9 entities)
> GSM241149 (5 of 9 entities)
> GSM241150 (6 of 9 entities)
> GSM241151 (7 of 9 entities)
> GSM241152 (8 of 9 entities)
> GSM241153 (9 of 9 entities)
> There were 50 or more warnings (use warnings() to see the first 50)
>
> > geoq <- getGEO("GSE9514")
> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
> Error in function (type, msg, asError = TRUE)  : couldn't connect to host
>
> > setInternet2(use=FALSE)
> > geoq <- getGEO("GSE9514")
> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
> Error in function (type, msg, asError = TRUE)  :
>   Server denied you to change to the given directory
>
> Any suggestions? I can't find anything on the list archives that 
> helps. I am thinking it has something to do with Windows Firewall, as 
> I can get to
>
> http://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
>
> using a browser, but not
>
> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
>
> but setting up a specific rule under Windows Firewall to allow R.exe 
> ftp access doesn't seem to help.
>
> Best,
>
> Jim
>
>
>
>
> On 5/2/2014 12:20 PM, Sean Davis wrote:
>> Hi, again, James.
>>
>> NCBI is still checking into the issue (may have been a storm-related
>> issue), but your (simplified) example now works for me.
>>
>>> gpl = getGEO('GPL90')
>> File stored at:
>> /var/folders/21/8t47kwys6vqb8606kdfn71780000gn/T//RtmpQXZfrr/GPL90.soft
>>> sessionInfo()
>> R version 3.0.2 Patched (2014-01-22 r64855)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] GEOquery_2.28.0      Biobase_2.21.7       BiocGenerics_0.7.5
>> [4] BiocInstaller_1.12.0
>>
>> loaded via a namespace (and not attached):
>> [1] RCurl_1.95-4.1 XML_3.95-0.2
>>
>>
>> Sean
>>
>> On Thu, May 1, 2014 at 1:11 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> Hi, James.
>>>
>>> Thanks for the report.  This is due to a change at NCBI.  I am
>>> checking with them to see if the change is meant to be permanent or is
>>> simply a transient issue.  I'll let everyone know as soon as I hear
>>> back from NCBI.
>>>
>>> Sean
>>>
>>>
>>> On Thu, May 1, 2014 at 9:19 AM, James W. MacDonald <jmacdon at uw.edu> 
>>> wrote:
>>>> Hi Sean,
>>>>
>>>>> geoq <- getGEO("GSE9514")
>>>> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
>>>> Found 1 file(s)
>>>> GSE9514_series_matrix.txt.gz
>>>>    % Total    % Received % Xferd  Average Speed   Time Time Time  
>>>> Current
>>>>                                   Dload  Upload   Total Spent Left  
>>>> Speed
>>>> 100  378k  100  378k    0     0   204k      0  0:00:01 0:00:01 
>>>> --:--:--
>>>> 204k
>>>> File stored at:
>>>> /data3/tmp/RtmpkDXZzR/GPL90.soft
>>>> Error in xj[i] : only 0's may be mixed with negative subscripts
>>>>
>>>> And the error appears to come from this section in parseGPL():
>>>>
>>>> if (hasDataTable) {
>>>>          nLinesToRead <- NULL
>>>>          if (!is.null(n)) {
>>>>              nLinesToRead <- n - length(txt)
>>>>          }
>>>>          dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
>>>>          geoDataTable <- new("GEODataTable", columns = cols, table =
>>>> dat3[1:(nrow(dat3) -
>>>>              1), ])
>>>>      }
>>>>
>>>> Where there is no error trapping for the case that fastTabRead 
>>>> returns a
>>>> zero row data.frame:
>>>>
>>>> debug: dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
>>>> Browse[3]> dim(dat3)
>>>> [1]  0 17
>>>> Browse[3]> dat3
>>>>   [1] ID ORF
>>>>   [3] SPOT_ID                          Species Scientific Name
>>>>   [5] Annotation Date                  Sequence Type
>>>>   [7] Sequence Source                  Target Description
>>>>   [9] Representative Public ID         Gene Title
>>>> [11] Gene Symbol ENTREZ_GENE_ID
>>>> [13] RefSeq Transcript ID             SGD accession number
>>>> [15] Gene Ontology Biological Process Gene Ontology Cellular Component
>>>> [17] Gene Ontology Molecular Function
>>>> <0 rows> (or 0-length row.names)
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>> -- 
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list