[BioC] GEOquery returns error "scan() expected 'an integer'"

Mon Oct 3 16:47:33 CEST 2011

2011/10/3 Timothée Flutre <timflutre at gmail.com>:
> Thanks a lot Sean for your help (and for providing us with GEOquery ;).
>
> However, in this case, I am not sure that using "combine" is enough to
> effectively put together the two files into a single object:
>
>> library(GEOquery)
>
>> gse = getGEO('GSE25935',destdir='.')
> Found 2 file(s)
> GSE25935_series_matrix-1.txt.gz
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
> 100 50.3M  100 50.3M    0     0  11.4M      0  0:00:04  0:00:04 --:--:--
> 13.0M
> File stored at:
> /tmp/Rtmpz2pEno/GPL4133.soft
> GSE25935_series_matrix-2.txt.gz
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
> 100 44.5M  100 44.5M    0     0  1929k      0  0:00:23  0:00:23 --:--:--
> 1920k
> Using locally cached version of GPL4133 found here:
> /tmp/Rtmpz2pEno/GPL4133.soft
>
> => There are some warnings "seek on a gzfile connection returned an internal
> error": can this cause a problem?

These are warnings due to changes in base R.  You can ignore them.

>> gse = combine(gse[[1]],gse[[2]])
> There were 12 warnings (use warnings() to see them)
>> warnings()
> Warning messages:
> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>  Lengths (255, 209) differ (string compare on first 209)209 string
> mismatches
> 2: data frame column 'title' levels not all.equal
> 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>  Lengths (255, 209) differ (string compare on first 209)209 string
> mismatches
> 4: data frame column 'geo_accession' levels not all.equal
> 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>  Lengths (255, 209) differ (string compare on first 209)209 string
> mismatches
> 6: data frame column 'source_name_ch1' levels not all.equal
> 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 59 string mismatches
> 8: data frame column 'characteristics_ch1.2' levels not all.equal
> 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>  Lengths (50, 48) differ (string compare on first 48)47 string mismatches
> 10: data frame column 'characteristics_ch1.3' levels not all.equal
> 11: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>  Lengths (255, 209) differ (string compare on first 209)209 string
> mismatches
> 12: data frame column 'supplementary_file' levels not all.equal
>
> => These warnings seem to indicate that the resulting object won't be well
> defined.
>
> And indeed:
>> Meta(gse)

gse is an ExpressionSet, so Meta will not work.

You probably want something like:

pData(gse)

> Error in function (classes, fdef, mtable)  :
>  unable to find an inherited method for function "Meta", for signature
> "ExpressionSet"
>
> Idem when I want to extract a matrix of samples x probes (which is really
> what I want here):
>> m <- matrix(nrow=nrow(Table(GSMList(gse)[[1]])),
> ncol=length(names(GSMList(gse))),
> dimnames=list(probe=Table(GSMList(gse)[[1]])$ID_REF,
> ind=names(GSMList(gse))))
> Error in Table(GSMList(gse)[[1]]) :
>  error in evaluating the argument 'object' in selecting a method for
> function 'Table': Error in function (classes, fdef, mtable)  :
>  unable to find an inherited method for function "GSMList", for signature
> "ExpressionSet"

No need to do any of this.  Using GSEMatrix=TRUE, which has been the
default for the last couple of years, alleviates the need to do the
stuff below.  If you want to get a samples x probes matrix, the data
from the two approaches will be equivalent, but one is clearly simpler
than the other.

Sean

> I am not familiar with "combine", sorry. Do you think there is a simple way
> to fix this, ie. to re-build a single series record from these two files?
> Otherwise, I may have to parse the files by another mean I guess.
> Thanks,
> Tim
>
> On Mon, Oct 3, 2011 at 6:18 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> 2011/10/2 Timothée Flutre <timflutre at gmail.com>:
>> > Hello,
>> >
>> > I downloaded a dataset from the GEO at the NCBI and launched the
>> following
>> > commands:
>> >> library(GEOquery)
>> >> gse <- getGEO(filename="GSE25935_family.soft.gz")
>> >
>> > Here is the error message I got:
>> > Parsing....
>> > Found 465 entities...
>> > GPL4133 (1 of 465 entities)
>> > GSM636943 (2 of 465 entities)
>> > ...
>> > GSM637180 (239 of 465 entities)
>> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>> na.strings,
>> >  :
>> >  scan() expected 'an integer', got '5.845752745'
>> > Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim ->
>> > read.table -> scan
>> >
>> > Is the input file badly formatted?
>>
>> Sorry for the bug.  In order to read some of the larger files in GEO,
>> I borrowed a trick from the limma package to just the first part of
>> the file to get the column types, then read the entire file after
>> telling R about the column types.  This ends up speeding up reading
>> large files by an order of magnitude sometimes.  That is the
>> background.
>>
>> In this case, the problem arises from a sample (GSM637180) that
>> contains 178 missing values as the first records.  Since I read only
>> the first 100, R assumes that this column is full of integers.  I'll
>> need to fix the code for table reading, but in the meantime, I would
>> suggest this as the workaround:
>>
>> gse = getGEO('GSE25935',destdir='.')
>> gse = combine(gse[[1]],gse[[2]]
>>
>> Using destdir in the getGEO call will allow you to reuse the
>> downloaded files (they are cached in the current directory, in other
>> words) in the case of having to run the code more than once.  The
>> combine() call is needed because NCBI GEO built the original series
>> matrix format to have at most 255 columns per file, so two such files
>> are needed to capture all the samples.
>>
>> Hope that helps,
>> Sean
>>
>>
>> > Thanks for any help,
>> > TF
>> >
>> >> sessionInfo()
>> > R version 2.13.1 (2011-07-08)
>> > Platform: x86_64-redhat-linux-gnu (64-bit)
>> >
>> > locale:
>> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> >
>> > other attached packages:
>> > [1] GEOquery_2.19.4 Biobase_2.10.0
>> >
>> > loaded via a namespace (and not attached):
>> > [1] RCurl_1.5-0 XML_3.2-0
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
>
>        [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>