[BioC] R: R: R: Probe-level analysis of exon arrays using xps

cstrato cstrato at aon.at
Mon Sep 12 21:08:25 CEST 2011


Dear Giovanni,

Late night seems to be a bad time to reply to questions.

Briefly, the reason for your crash is that by default rma() does import 
the table with the expression data as data.frame. This is usually fine 
but in your case with 185 exon arrays and exonlevel="core+extended" this 
table is too large for 4GB RAM so that R crashes.

The explanation and the workaround to prevent the crash is as follows:

If you look at the help ?rma you will see rma(..., add.data=TRUE). This 
setting will import the expression table as data.frame into slot 'data'. 
Thus in your case you need to do:

 > data.rma <- rma(data.exon, ..., add.data=FALSE)

The same is true if you want to run the medpol step:

 > data.mp.rma <- summarize.rma(data.qu.rma, "HuExonRMASum",
 > +              exonlevel="core+affx", add.data=FALSE)

Note that at a later time in a new R session you can still add the 
expression table to slot 'data':

 > data.rma <- attachExpr(data.rma)

Alternatively, you could add a subset of expression data only:

 > subdata.rma <- attachExpr(data.rma, treenames=c("Tree1.mdp", 
"Tree2.mdp"))


The same holds true for dabg.call. By default it will store the p-values 
in slot 'data' and the detection calls in slot 'detcall'. If you do not 
have enough memory you could do:

 > call.dabg <- dabg.call(data.exon, ..., add.data=FALSE)

You can still import the data afterwards, see: ?attachCall and ?attachPVal.

I hope that this could clarify your problems.

Best regards
Christian


On 9/11/11 8:38 PM, Lavorgna Giovanni wrote:
> Dear Christian,
>
> many thanks for your answer and for clarifying matters with the different ROOT releases.
>
> Yes, before I was using root_v5.26.00 without patch and I can confirm you that the pipeline you designed without the median-polish sumamrization ran without problems. However, I tried also the more classical classical rma/dabg approach on my dataset and this time the script crashed at the end of the rma step without writing the root and the text file.
>
> I also installed the latest xps_1.12.1 and root_v5.27.04 as you recommended. As usual, the script without the medpolish step went fine. Instead, the classical rma/dabg approach ran until the end of the RMA step, wrote the root and the text files and then crashed. However, I was still able to get dabg data by running this step before rma. So I guess I am still able to get what I am looking for.
>
> Thanks again for your help and for making xps available to our community.
>
> Giovanni
>
>
>
>
> -----------------------------------------------------------
> Giovanni Lavorgna
> DIBIT-HSR
> Via Olgettina, 58
> 20132 Milan Italy
> Tel: +39-02-2643-4776
> Fax: +39-02-2643-4767
> Email: giovanni.lavorgna at hsr.it
> ________________________________________
> Da: cstrato [cstrato at aon.at]
> Inviato: venerdì 2 settembre 2011 21.57
> A: Lavorgna Giovanni
> Cc: bioconductor at r-project.org
> Oggetto: Re: R: R: [BioC] Probe-level analysis of exon arrays using xps
>
> Dear Giovanni,
>
> I am glad to hear that my suggestions did work for you.
>
> Regarding your questions:
>
> ad 1, Yes, you can change to 'exonlevel="core+extended+affx"'. If you
> want you can even use 'exonlevel="all"' which is short for
> 'exonlevel="core+extended+full+affx"' (see ?exonLevel).
>
>
> ad 2, Thank you for your report about the memory problems with
> root_v5.28.00. However, please note that the currently recommended
> version in the Description file of xps is root_v5.27.04.
>
> Sadly, the production version root_v5.28.00 had a severe problem when
> writing a tree, which only appeared when running xps from within R but
> not when running xps as root macro, see:
> http://root.cern.ch/phpBB3/viewtopic.php?f=3&t=11887&p=51313&sid=5bbfb7ef1aba101060cc67de9417ea2e#p51313
> For this reason I could not use root_v5.28.00 for the current release
> and had to ask the BioC maintainers to install root_v5.27.04 on the BioC
> servers. Thus all users of xps which are downloading the xps binary must
> install root_v5.27.04 first, and I recommend always that people who are
> compiling the source themselves, use the officially recommended version
> of ROOT, too.
>
> Since you do not have a problem with root_v5.26.00 I am wondering if you
> have installed the patch release of root_v5.26.00, since the original
> version had a severe memory problem, too, which had only been solved in
> patch release root_v5.26.00c and later.
>
> BTW, the version of xps, xps_1.8.3, that you mention is pretty old,
> since the current verion for BioC2.8 is xps_1.12.1.
>
> I would recommend that you use xps_1.12.1 with root_v5.27.04, or you try
> to install the newest production version of ROOT, root_v5.30.01, which I
> am currently testing with my development version xps_1.13.7. In both
> cases I would be interested to know if you experience any memory
> problems with root.
>
> As you might already know, all versions of root can be downloaded from:
> ftp://root.cern.ch/root/
>
> I hope this information could clarify these issues.
>
> Best regards
> Christian
>
>
> On 9/2/11 11:08 AM, Lavorgna Giovanni wrote:
>> Dear Christian,
>>
>> I had the chance to use the whole pipeline you designed and it performed very well. Basically, I got the probe intensities I was looking using the whole 185 array dataset.  However, if you don't mind, I have a couple of more questions:
>>
>> 1) I would need to check also the probe intensities of a transcript whose existence is supported only by ESTs. Therefore, it falls in the "Extended" category. Would it be OK if I change the 'exonlevel' parameter in this line from:
>>
>> ### 1.step: background - rma
>>    >   data.bg.rma<- bgcorrect.rma(data.exon, "HuExonRMABgrd", filedir=datdir,
>> +                select="antigenomic", exonlevel="core+affx")
>>
>> to:
>>
>>    >   data.bg.rma<- bgcorrect.rma(data.exon, "HuExonRMABgrd", filedir=datdir,
>> +                select="antigenomic", exonlevel="core+extended+affx")
>>
>>
>> and in this other line from:
>>
>> ### 2step: normalization - quantile
>>    >   data.qu.rma<- normalize.quantiles(data.bg.rma, "HuExonRMANorm",
>> filedir=datdir ,
>> +                exonlevel="core+affx")
>>
>> to:
>>
>> ### 2step: normalization - quantile
>>    >   data.qu.rma<- normalize.quantiles(data.bg.rma, "HuExonRMANorm",
>> filedir=datdir ,
>> +                exonlevel="core+extended+affx")
>>
>>
>>
>> 2)  On my 4Gb Linux system, using xps version 1.8.3, I noticed some problems.
>> Since ROOT version 5.28, the background correction step steadily increases memory consumption, until the job is killed by the system after having completed 142 out of 185 arrays. Memory usage at this critical point is well over 80%.
>> Instead, up  to ROOT version 5.26, using the same test case, memory occupancy stays consistently under 20% during the run, with the script smoothly reaching the end. I suspect that a bug related to memory leakage might have been introduced in version 5.28. Probably, it has gone unnoticed until now because not many people are analyzing so many arrays and people that do so might be using systems with much more RAM than I.
>>     My test case is rather big and doesn't easily lend itself to a bug report to ROOT people.  However, you mentioned that you already observed a lot of disk swapping on your 2Gb system, using the 6 arrays breast and prostate data from the Affymetrix. I don't know exactly which root version you are using. However, if you  are using the latest version 5.30 or version 5.28 and by installing ROOT version 5.26  memory usage dramatically drops, this would confirm the bug existence and would be an excellent test-case to provide to people at CERN.
>> Please let me know if you would like me to provide more details about this issue.
>> Thanks again.
>>
>> Giovanni
>>
>> This is the session info (well, before crushing :-) )
>>
>>> sessionInfo()
>> R version 2.11.0 (2010-04-22)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>    [1] LC_CTYPE=en_US       LC_NUMERIC=C         LC_TIME=en_US
>>    [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=en_US
>>    [7] LC_PAPER=en_US       LC_NAME=C            LC_ADDRESS=C
>> [10] LC_TELEPHONE=C       LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.0
>>
>>
>>
>>
>>
>>
>> ____
>> Da: cstrato [cstrato at aon.at]
>> Inviato: domenica 21 agosto 2011 19.56
>> A: Lavorgna Giovanni
>> Cc: bioconductor at r-project.org
>> Oggetto: Re: R: [BioC] Probe-level analysis of exon arrays using xps
>>
>> Dear Giovanni,
>>
>> I will try to sketch the beginning of your pipeline using a subset of
>> the Affymetrix exon dataset:
>>
>> First, let me suggest to use the new annotation files version na32 which
>> Affymetrix have just released, to create the scheme file:
>> ### import libary and annotation files
>>    >   libdir<- "/Volumes/GigaDrive/Affy/libraryfiles"
>>    >   anndir<- "/Volumes/GigaDrive/Affy/Annotation"
>>    >   scmdir<- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
>>    >   scheme.exon<- import.exon.scheme("huex10stv2", filedir = scmdir,
>> +                file.path(libdir, "HuEx-1_0-st-v2_libraryfile",
>> "HuEx-1_0-st-r2", "HuEx-1_0-st-v2.r2.clf"),
>> +                file.path(libdir, "HuEx-1_0-st-v2_libraryfile",
>> "HuEx-1_0-st-r2", "HuEx-1_0-st-v2.r2.pgf"),
>> +                file.path(anndir, "Version11Jul",
>> "HuEx-1_0-st-v2.na32.hg19.probeset.csv"),
>> +                file.path(anndir, "Version11Jul",
>> "HuEx-1_0-st-v2.na32.hg19.transcript.csv"))
>>
>> For this example I import the breast and prostate data from the
>> Affymetrix exon dataset:
>> ### import CEL-files
>>    >   celdir<- "/Volumes/GigaDrive/ChipData/Exon/HuMixture"
>>    >   datdir<- getwd()
>>    >   celfiles<-
>> c("huex_wta_breast_A.CEL","huex_wta_breast_B.CEL","huex_wta_breast_C.CEL",
>> +
>> "huex_wta_prostate_A.CEL","huex_wta_prostate_B.CEL","huex_wta_prostate_C.CEL")
>>    >   celnames<-
>> c("BreastA","BreastB","BreastC","ProstateA","ProstateB","ProstateC")
>>    >   data.exon<- import.data(scheme.exon, "HuMixtureExon", filedir=datdir,
>> +              celdir=celdir, celfiles=celfiles, celnames=celnames)
>>
>> Now I suggest to start a new R-session for preprocessing:
>> ### first, load ROOT scheme file and ROOT data file
>>    >   scmdir<- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
>>    >   scheme.exon<- root.scheme(file.path(scmdir,"huex10stv2.root"))
>>    >   datdir<- getwd()
>>    >   data.exon<- root.data(scheme.exon,
>> file.path(datdir,"HuMixtureExon_cel.root"))
>>    >   str(data.exon)
>>
>> ### 1.step: background - rma
>>    >   data.bg.rma<- bgcorrect.rma(data.exon, "HuExonRMABgrd", filedir=datdir,
>> +                select="antigenomic", exonlevel="core+affx")
>>
>> ### 2step: normalization - quantile
>>    >   data.qu.rma<- normalize.quantiles(data.bg.rma, "HuExonRMANorm",
>> filedir=datdir ,
>> +                exonlevel="core+affx")
>>
>> ### 3.step: summarization - medpol (not necessary in your case)
>> #data.mp.rma<- summarize.rma(data.qu.rma, "HuExonRMASum", filedir=datdir,
>> #               exonlevel="core+affx")
>>
>> To dump the probes to a text file you simply do:
>> ### export normalized probes intensities
>>    >   export(data.qu.rma, treetype="cqu", varlist = "fInten",
>> +        outfile=paste(datdir, "BreastProstate_cqu.txt", sep="/"))
>>
>> Now you have table "BreastProstate_cqu.txt" containing the
>> (X,Y)-coordinates followed by the normalized intensities for each
>> sample. However, please note that for 6 samples this file has already a
>> size of 180 MB, thus for your 185 samples it will probably have a size
>> of more than 4 GB. Thus it is not quite clear to me how you want to proceed.
>>
>> Nevertheless, since you are interested in selected transcripts only, you
>> need to get the (X,Y)-coordinates for these transcripts in order to
>> extract the intensities from the table. Let us assume that you want to
>> get the (X,Y)-coordinates for the CD44 gene. For this purpose you have
>> three options:
>>
>> 1, If you are using the development version xps_1.13.6, which will be
>> available for download from the BioC development site in a few days, you
>> can do the following:
>> ### get internal UNIT_ID for CD44
>>    >   id<- symbol2unitID(scheme.exon, symbol="CD44",
>> unittype="transcript", as.list =TRUE)
>>    >   id
>> $`3326635`
>> [1] "185195"
>>
>> $`3326730`
>> [1] "185196"
>>
>> ### attach (x,y)-coordinates for all UNIT_IDs
>>    >   data.qu.rma<- attachDataXY(data.qu.rma)
>>
>> ### get (x,y)-coordinates for CD44
>>    >   xy<- indexUnits(data.qu.rma, which="core+affx", unitID=unlist(id))
>> Error in .local(object, ...) : only 1 of 2 UNIT_ID are valid
>>
>> The reason for this error is that only transcript_cluster_id "3326635"
>> belongs to level "core" while "3326730" belongs to level "extended".
>> Thus you need to do:
>>    >   xy<- indexUnits(data.qu.rma, which="core+affx", unitID=id[[1]])
>>    >   dim(xy)
>> [1] 103   4
>>    >   head(xy)
>>            UNIT_ID    X    Y      XY
>> 3151152  185195 1638  388  994919
>> 3151153  185195 2407   79  204648
>> 3151154  185195 2393  882 2260314
>> 3151155  185195 1728 1223 3132609
>> 3151156  185195 1843 1404 3596084
>> 3151157  185195 1369 2167 5548890
>>
>> Now you have the (X,Y)-coordinates for the CD44 gene which you can use
>> to get the normalized intensities from table "BreastProstate_cqu.txt".
>>
>>
>> 2, If you would have enough RAM and the new version xps_1.13.6 then the
>> most simple way would be to attach the data. In the following I will
>> attach only the first two trees, however this causes already a lot of
>> swapping on my Mac with 2GB RAM:
>> ### attach data for 2 trees
>>    >   treenames<- unlist(treeNames(data.qu.rma))
>>    >   data.qu.rma<- attachInten(data.qu.rma, treenames=treenames[1:2])
>>
>> ## get the normalized intensities for CD44
>>    >   data<- validData(data.qu.rma, which="core+affx", unitID=185195)
>>    >   dim(data)
>> [1] 103   2
>>    >   head(data)
>>            BreastA.cqu_MEAN BreastB.cqu_MEAN
>> 994919          29.51080          3.60245
>> 204648           4.94509         58.01030
>> 2260314         25.84620          2.86042
>> 3132609          2.91305          3.93297
>> 3596084        123.34400        147.01000
>> 5548890        434.51000        367.34400
>>
>> ### remove the data
>>    >   data.qu.rma<- removeInten(data.qu.rma)
>>
>>
>> 3, With the current version of xps extracting the (X,Y)-coordinates for
>> the CD44 gene is more complicated:
>> First you need to get the transcript_cluster_id for CD44:
>>    >   ann<- export(scheme.exon, treetype="ann",
>> varlist="fTranscriptID:fSymbol",
>> +        as.dataframe=TRUE, outfile="tmp_ann.txt")
>>    >   id<- split(ann[,"GeneSymbol"], ann[,"TranscriptClusterID"]);
>>    >   id<- lapply("CD44", function(x) names(which(id == x)));
>>    >   id
>> [[1]]
>> [1] "3326635" "3326730"
>>
>> Then you need to get the internal UNIT_ID:
>>    >   idx<- export(scheme.exon, treetype="idx",
>> varlist="fUnitName:fTranscriptID",
>> +        as.dataframe=TRUE, outfile="tmp_idx.txt")
>>    >   unitid<- split(idx[,"UnitName"], idx[,"UNIT_ID"]);
>>    >   unitid<- lapply(unlist(id), function(x) names(which(unitid == x)));
>>    >   unitid
>> [[1]]
>> [1] "185195"
>>
>> [[2]]
>> [1] "185196"
>>
>> Finally you need to get the (X,Y)-coordinates for "3326635" only:
>>    >   scm<- export(scheme.exon, treetype="scm", varlist="fUnitID:fX:fY:fMask",
>> +        as.dataframe=TRUE, outfile="tmp_scm.txt")
>>    >   xy<- scm[which(scm[,"UNIT_ID"] == unlist(unitid)[1]),]
>>    >   dim(xy)
>> [1] 365   4
>>
>> As you see dim(xy) has more rows than above, thus you need to get the
>> "core" subset only (see ?exonLevel):
>>    >   unique(xy[,"Mask"])
>> [1] 8192 1024 2048  256  512 4096
>>    >   xy<- rbind(xy[which(xy[,"Mask"] == 8192),], xy[which(xy[,"Mask"] ==
>> 1024),])
>>    >   dim(xy)
>> [1] 103   4
>>    >   head(xy)
>>            UNIT_ID    X    Y Mask
>> 3151152  185195 1638  388 8192
>> 3151153  185195 2407   79 8192
>> 3151154  185195 2393  882 8192
>> 3151155  185195 1728 1223 8192
>> 3151216  185195 1781   15 8192
>> 3151217  185195 1442  762 8192
>>
>> Now you have the (X,Y)-coordinates for the CD44 gene as above.
>>
>>
>> ad a) Remove probes that hybridize to multiple loci in the genome:
>>
>> I do not know how you want to remove these probes, however you can get
>> the sequences of all probes from the following table:
>>    >   export(scheme.exon, treetype="prb")
>>
>> Furthermore, when exporting the probeset annotation by:
>>    >   export(scheme.exon, treetype="anp")
>> you will see that the table contains column "CrossHybType" (see the Affy
>> README file for exon arrays). Only crosshyb_type = 1 (unique) does
>> contain unique probesets.
>>
>> Please let me know if this is the info you were looking for.
>>
>> Best regards
>> Christian
>>
>>
>> On 8/18/11 11:48 PM, Lavorgna Giovanni wrote:
>>> Dear Cristian,
>>> many thanks for your prompt answer. In my case, once I have the probe intensities, I would like to do the following two preliminary steps:
>>> a) Remove probes that hybridize to multiple loci in the genome.
>>> b) Remove probes that show a correlation coefficient below a certain threshold.
>>> Then, I would like to calculate the differential expression in diseased vs. healthy samples for a few transcripts of mine by averaging the ratio of the probe intensities. I hope that the increased granularity and the reduced background of this method can give a better resolution to my analysis. As I said before, similar probe selection methods have been already described (see for example Xing Y, Kapur K, Wong WH. PLoS ONE. 2006 20;1:e88) and applied to genome wide studies.
>>>
>>> In my case, after step a), I would like simply to dump the probes to a text file, select those of my interest and read them into to a spreadsheet in order to calculate the correlation coefficient and the fold-change of my transcripts. You already started to sketch the beginning of pipeline I should use: if I am not asking for too much, I would be grateful if you could elaborate it a little more to include also these final two steps.
>>> Thanks again for your assistance and keep up the good work.
>>> Giovanni
>>>
>>> ________________________________________
>>> Da: cstrato [cstrato at aon.at]
>>> Inviato: giovedì 18 agosto 2011 20.28
>>> A: Lavorgna Giovanni
>>> Cc: bioconductor at r-project.org
>>> Oggetto: Re: [BioC] Probe-level analysis of exon arrays using xps
>>>
>>> Dear Giovanni,
>>>
>>> In principle you could do the background and normalization steps
>>> separately, e.g.:
>>>     >    data.bg.rma<- bgcorrect.rma(data.exon, ...)
>>>     >    data.qu.rma<- normalize.quantiles(data.bg.rma, ...)
>>>
>>> Now you have the normalized probes, however, you cannot do any
>>> summarization such as median-polish or mean.
>>>
>>> It is not quite clear to me how you want to proceed with the normalized
>>> probe intensities?
>>>
>>> Best regards
>>> Christian
>>> _._._._._._._._._._._._._._._._._._
>>> C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
>>> V.i.e.n.n.a           A.u.s.t.r.i.a
>>> e.m.a.i.l:        cstrato at aon.at
>>> _._._._._._._._._._._._._._._._._._
>>>
>>>
>>> On 8/18/11 7:08 PM, Lavorgna Giovanni wrote:
>>>>
>>>> As you guys know, there is a growing evidence showing that a probe-level analysis  (as opposed to a probeset-level or a gene-level analysis) could be useful in analyzing Exon arrays. I am currently analyzing several human exon chips (185!) on a 4 GB machine and I am using the xps package. I would like to stick to this software since it allows to manage several chips with the resources at hand and I was wondering if anyone has ever tried to perform probe level analysis using xps. Also, I would be grateful if anyone could point out alternative resources to perform this job.
>>>> Thanks in advance.
>>>> Giovanni
>>>>
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------
>>>>
>>>> Dai il tuo 5XMILLE al San Raffaele. Basta una firma.
>>>> Se firmi per la ricerca sanitaria del San Raffaele di Milano, firmi per tutti.
>>>> C.F. 03 06 42 80 153.
>>>> INFO: 5xmille at hsr.it - www.5xmille.org
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> -----------------------------------------------------------------
>>>
>>> Dai il tuo 5XMILLE al San Raffaele. Basta una firma.
>>> Se firmi per la ricerca sanitaria del San Raffaele di Milano, firmi per tutti.
>>> C.F. 03 06 42 80 153.
>>> INFO: 5xmille at hsr.it - www.5xmille.org
>>>
>>
>



More information about the Bioconductor mailing list