[BioC] GEOquery: getGEO() doesn\'t work (error \"invalid \'nlines\' argument\")
James W. MacDonald
jmacdon at uw.edu
Tue May 29 17:03:35 CEST 2012
Hi Simone,
On 5/29/2012 10:25 AM, ecsi at gmx.net wrote:
> Hi Jim,
>
>> Why are you using system.file() in this context?
>
> Because there is an example in the GEOquery vignette ("2 Getting
> Started using GEOquery") which does it like this.
I see. That is one of the downsides of the vignette system - in order to
have a vignette work correctly, using some external data, those data
have to be parked somewhere in the package directory. An alternative
would be to have a separate data package, but that means end users have
to download one additional thing.
So the reason the vignette uses that paradigm is because the data being
used are in the package directory. However, as you note below, you
_haven't_ downloaded data to the package directory, so system.file()
isn't the way to go. In other words, system.file() is only designed to
help people easily detect where a given install of R has its package
directory - it is not intended for reading files in general.
>
>> Did you really download the soft file to your GEOquery library
>> directory? That seems odd to me.
>
> I downloaded it to a local data repository in our network (it is
> obligatory to do it this way in this case).
>
> Why does it seem odd to you? Because I downloaded the soft file?
No, not that you downloaded the file, what seemed odd was that you were
using system.file(), which implies that you had downloaded the soft file
to a very specific place. Let me give you an example:
On my Linux box
> system.file(package="GEOquery")
[1] "/misc/staff/jmacdon/R-devel/library/GEOquery"
On my Windows box
> system.file(package="GEOquery")
[1] "C:/Users/bioinf_admin/R/win-library/2.14/GEOquery"
So when you use system.file() you are specifically telling GEOquery to
look for a file that is in your GEOquery library directory, rather than
telling GEOquery the actual directory. That is what Sean was getting at
in his response to you.
> This was a recommendation of a colleague who works a lot with GEO, we
> thought the soft files would be the best option because they contain
> all the information available and furthermore they are available for
> all the GEO series I have analyze. As I already wrote in reply to the
> answer of Sean, if there is any better way to do it, I will be happy
> to hear about it!
Sean already gave it to you. To further elaborate:
> mypath <- "C:/Users/bioinf_admin/Desktop/"
> GSE19711 <- getGEO('GSE19711',destdir=mypath)
This will result in a list of ExpressionSets
> length(GSE19711)
[1] 3
> GSE19711[[1]]
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27578 features, 255 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM491937 GSM491938 ... GSM492191 (255 total)
varLabels: title geo_accession ... data_row_count (44 total)
varMetadata: labelDescription
featureData
featureNames: cg00000292 cg00002426 ... cg27665659 (27578 total)
fvarLabels: ID Name ... ORF (38 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL8490
I doubt you will be able to automate too much of this, as the phenoData
slots for these ExpressionSets can contain whatever the experimenter
thought was interesting, in addition to what is required by GEO:
> names(pData(phenoData(GSE19711[[1]])))
[1] "title" "geo_accession"
[3] "status" "submission_date"
[5] "last_update_date" "type"
[7] "channel_count" "source_name_ch1"
[9] "organism_ch1" "characteristics_ch1"
[11] "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4"
[15] "characteristics_ch1.5" "characteristics_ch1.6"
[17] "characteristics_ch1.7" "characteristics_ch1.8"
[19] "characteristics_ch1.9" "characteristics_ch1.10"
[21] "characteristics_ch1.11" "characteristics_ch1.12"
[23] "characteristics_ch1.13" "molecule_ch1"
[25] "extract_protocol_ch1" "label_ch1"
[27] "label_protocol_ch1" "taxid_ch1"
[29] "hyb_protocol" "scan_protocol"
[31] "description" "data_processing"
[33] "platform_id" "contact_name"
[35] "contact_email" "contact_phone"
[37] "contact_department" "contact_institute"
[39] "contact_address" "contact_city"
[41] "contact_zip/postal_code" "contact_country"
[43] "supplementary_file" "data_row_count"
And we can then see what the characteristics are:
> head(pData(phenoData(GSE19711[[1]])), 2)[,11:23]
characteristics_ch1.1 characteristics_ch1.2
GSM491937 agegroupatsampledraw: 65 to 70 ageatrecruitment: 68
GSM491938 agegroupatsampledraw: Over 75 ageatrecruitment: 81
characteristics_ch1.3 characteristics_ch1.4
characteristics_ch1.5
GSM491937 ageatdiagnosis: 68 histology: Endometrioid
stage: Ic
GSM491938 ageatdiagnosis: 80 histology: Carcinosarcoma
stage: IIIb
characteristics_ch1.6 characteristics_ch1.7
GSM491937 grade: Grade 2 pre-treatment sample: Yes
GSM491938 grade: Grade 3 pre-treatment sample: No
characteristics_ch1.8 characteristics_ch1.9
GSM491937 post-treatment sample: No ca125: 1717
GSM491938 post-treatment sample: Yes ca125: 32.89
characteristics_ch1.10 characteristics_ch1.11
GSM491937 batch: 1 beadchip_well: 4447820175_A
GSM491938 batch: 1 beadchip_well: 4447820175_B
characteristics_ch1.12 characteristics_ch1.13
GSM491937 bs conversion c1: Grn 5706 bs conversion c2: Grn 5538
GSM491938 bs conversion c1: Grn 6861 bs conversion c2: Grn 6141
Does that help?
Best,
Jim
>
> Best,
> Simone
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list