[BioC] Loading quantiles normalized root data in XPS
Paul Geeleher
paulgeeleher at gmail.com
Tue Feb 21 13:23:42 CET 2012
Hi Christian,
Thanks again for your very helpful replies. I think using the C++
macros might be the way I go once I get everything ironed out.
I have what I think is one last question. When try to convert gene
symbol to internal ID:
allInternalGeneIds[[i]] <- symbol2unitID(scheme.huex10stv2 ,
symbol=geneSym, unittype="transcript", as.list =TRUE)
I get the following message:
slot ‘unitname’ is empty, importing data from ‘HuEx-1_0-st-v2.idx’ ...
Which leads me to believe there is a way to attach "unitname"? I've
been unable to find how to to this though? I think if you know if this
is possible it could speed up my pipeline quite a bit, as I have to
repeat this step for every gene!
Thanks again,
Paul.
On 2/14/12, cstrato <cstrato at aon.at> wrote:
> Dear Paul,
>
> You could export the whole data as table using:
>
> export(data_qn, treename="*", treetype="cqu", varlist="fInten",
> outfile="data_cqu.txt",as.dataframe=FALSE)
>
> Then you have a huge text-file, which you can then use for your
> purposes. However, since you do not have enough memory to work with R
> you would need to either import only parts of the table into R, or use
> other languages such as perl for further analysis.
>
> For the DABG data you could use function export.call().
>
> Alternatively, you could also work completely from within ROOT, using
> C++ macros. Examples how to use xps from within ROOT are shown in
> directory xps/examples/macro4XPS.C
>
> Best regards
> Christian
>
>
> On 2/14/12 1:54 PM, Paul Geeleher wrote:
>> Hi Christian,
>>
>> Thanks for your replies.
>>
>> As per your suggestions I've been been able to map between xy
>> co-ordinates, probe IDs and PROBESET_IDS.
>>
>> I've been able to extract the expression level for a given gene (for a
>> subset of samples) using this:
>>
>> id<- symbol2unitID(scheme.huex10stv2 , symbol="CDH11",
>> unittype="transcript", as.list =TRUE)
>> treenames<- unlist(treeNames(data_qn))
>> data_qn<- attachInten(data_qn, treenames=treenames[1:3])
>> xyIntens<- validData(data_qn, unitID=unlist(id))
>>
>> With "validData()" is that it seems I can only get the expression
>> levels for samples which are "attached". This means I'd probably have
>> to attach each sample separately (which will take a long time) as I
>> don't have enough memory to attach all samples.
>>
>> I'm wondering if there is a way to extract this data without having to
>> attach the data? If something similar is possible for the dabg data
>> (at probeset level) that would also be a big help in speeding up my
>> pipeline (which will be complete if I can implement these last two
>> steps)!
>>
>> Thanks again for all your help its much appreciated,
>>
>> Paul.
>>
>>
>> On Tue, Feb 7, 2012 at 8:48 PM, cstrato<cstrato at aon.at> wrote:
>>> Dear Paul,
>>>
>>> see replies below:
>>>
>>>
>>> On 2/7/12 1:52 PM, Paul Geeleher wrote:
>>>>
>>>> Hi Christian, apologies for the lack of clarity in my previous email,
>>>> I'll try to clear this up, see replies below:
>>>>
>>>> On Mon, Feb 6, 2012 at 7:51 PM, cstrato<cstrato at aon.at> wrote:
>>>>>
>>>>> Dear Paul,
>>>>>
>>>>> I am afraid that it is not quite clear to me what you want to do.
>>>>>
>>>>> What do you mean with "obtain the expression levels of the probes"? Do
>>>>> you
>>>>> mean "probes" (i.e. oligos) or do you mean "probesets" as in your
>>>>> question
>>>>> about DABG?
>>>>
>>>>
>>>> I've read in and quantiles normalized probe level data for 176 arrays
>>>> and what I'm trying to do now is set every probe (individual probe,
>>>> not probeset) that is not detected above background to zero. But from
>>>> what I've read dabg.call() can only create p-values for probeset
>>>> level? So as compromise I'm now trying to set the expression of all
>>>> *individual probes* which belong to a *probeset* not detected above
>>>> background to zero. I hope that makes sense.
>>>>
>>>
>>>
>>> Did you read the exon array whitepapers which you can download from
>>> Affymetrix?
>>>
>>> Every probeset consists of 1-4 probes only, and every exon consists
>>> usually
>>> of 1-2 probesets. Each gene has a transcript_cluster_id, which consists
>>> of
>>> one or more probeset_ids. (You can see the mapping between ids using
>>> function export.scheme(..,treetype="anp",..)
>>>
>>> Since the smallest unit is the probeset, function dabg.call() will only
>>> work
>>> at the probeset (and transcript) level. If you set all probes of a
>>> probeset
>>> to zero you may loose an entire exon.
>>>
>>>
>>>
>>>>>
>>>>> How did you create "data_hapMap"? Maybe you could send me your complete
>>>>> code
>>>>> otherwise I am only able to guess.
>>>>
>>>>
>>>> "data_hapMap" was a mistake, sorry I copied the wrong object name it
>>>> should have been "data_qn" which is quantiles normalized probe level
>>>> data that I created like this:
>>>>
>>>> data_hapMap<- root.data(scheme.huex10stv2,
>>>> "/data2/paul/normalization_project/root_data/hapMap_root_data_cel.root")
>>>> #read raw data
>>>> data_qn<- normalize.quantiles(data_hapMap, "exon_quantiles",
>>>> filedir=rootdata_Dir, exonlevel="all") #quantiles normalize at *probe*
>>>> level
>>>>
>>>> So "data_qn" is the object I want to work with, based on your advice
>>>> I've figured out that to access the expression levels in "data_qn" I
>>>> need to use "attachInten()", then I can use "intensities()" to access
>>>> the expression levels:
>>>>
>>>> treenames<- unlist(treeNames(data_qn))
>>>> data_qn<- attachInten(data_qn, treenames=treenames[1:2]) #attach the
>>>> first two samples
>>>>
>>>>> head(intensity(data_qn))
>>>>
>>>> X Y GSM188869.cqu_MEAN GSM188870.cqu_MEAN
>>>> 1 0 0 0.0000 0.000
>>>> 2 1 0 0.0000 0.000
>>>> 3 2 0 0.0000 0.000
>>>> 4 3 0 0.0000 0.000
>>>> 5 4 0 0.0000 0.000
>>>> 6 5 0 85.3523 121.733
>>>>
>>>> Does the X column in the output above represent the probe ID? If so
>>>> and I have a mapping of probeset IDs to corresponding probe IDs, it
>>>> should be fairly straightforward to set probes that are not detected
>>>> above background to zero? Perhaps there is a more straightforward way
>>>> of doing this though?
>>>>
>>>
>>> No, (X,Y) are the coordinates of the single probes on the exon array. You
>>> can use function export.scheme(..,treetype="scm",..) to get the mapping
>>> between (X,Y) and an internal PROBESET_ID, e.g.:
>>>
>>> UNIT_ID X Y ProbeLength Mask EXON_ID PROBESET_ID
>>> 31 986 1674 25 512 31 31
>>> 31 1092 677 25 512 31 31
>>> 31 796 1862 25 512 31 31
>>> 31 917 193 25 512 31 31
>>> 31 341 1677 25 512 32 32
>>> 31 144 2250 25 512 32 32
>>> 31 689 262 25 512 32 32
>>> 31 579 1670 25 512 32 32
>>>
>>> Then you can use export.scheme(..,treetype="pbs",..) to map PROBESET_ID
>>> (=UNIT_ID) to the ProbesetID, e.g.:
>>>
>>> UNIT_ID ProbesetID NumCells NumAtoms NumSubunits
>>> UnitType
>>> 31 2315101 4 4 1 512
>>> 32 2315102 4 4 1 512
>>>
>>> As you see PROBESET_IDs 31 and 32 have each 4 probes and belong to the
>>> Affymetrix ProbesetIDs 2315101 and 2315102, respectively.
>>>
>>> You could also use functions indexUnits(), and unitID2probesetID() or
>>> unitID2transcriptID(), respectively.
>>>
>>> Best regards
>>> Christian
>>>
>>>
>>>
>>>>
>>>>>
>>>>> I guess that "data_hapMap" contains the raw data. For these the slot
>>>>> "data"
>>>>> is empty to save memory. So you need to use either attachData() or
>>>>> attachInten(). However since you are using exon arrays you may not have
>>>>> enough RAM, so it would be better to use function export() or
>>>>> export.data(),
>>>>> or attach only a subset, see help ?attachData. See also vignette
>>>>> xps.pdf
>>>>> (chapter 2.3).
>>>>
>>>>
>>>> I think RAM shouldn't be an issue if I attach the samples one at a
>>>> time? (I actually have access to a machine with 32gigs RAM but ideally
>>>> would like to get what I'm doing to run on a standard desktop, which
>>>> is actually why I'm using XPS!).
>>>>
>>>>>
>>>>> When you talk about "expression matrix", how did you create it? Maybe
>>>>> you
>>>>> could use function validExpr(), but w/o seeing your code it is hard to
>>>>> tell.
>>>>> For DABG there are functions pvalData() and presCall(), see the
>>>>> examples
>>>>> in
>>>>> help ?dabg.call.
>>>>
>>>>
>>>> Yes I've managed to use dabg.call() at probeset level and access the
>>>> p-values using pvalData() alright.
>>>>
>>>> Thanks again for all of your help and patience!
>>>>
>>>> Kind Regards,
>>>>
>>>> Paul.
>>>>
>>>>
>>>>>
>>>>> Best regards
>>>>> Christian
>>>>>
>>>>>
>>>>>
>>>>> On 2/6/12 4:33 PM, Paul Geeleher wrote:
>>>>>>
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Thanks for your quick and informative reply.
>>>>>>
>>>>>> I have re-run the analysis and saved the R objects as you suggest. The
>>>>>> next thing I'm trying to do is to obtain the expression levels of the
>>>>>> probes, but this doesn't seem to be working for me:
>>>>>>
>>>>>>> a<- validData(data_hapMap)
>>>>>>
>>>>>>
>>>>>> Error in .local(object, ...) : slot "data" has no data
>>>>>>
>>>>>> Based on the documentation I think validData() is the correct
>>>>>> function.
>>>>>>
>>>>>> I've also performed probeset level DABG and I'm trying to set
>>>>>> individual probes which belong to probesets with DABG< .05 to 0
>>>>>> in
>>>>>> the expression matrix.
>>>>>>
>>>>>> But it seems I can't see the expression matrix using validData().
>>>>>> Perhaps there is another function. Any ideas?
>>>>>>
>>>>>> Thank you again for your help with this, I'm very grateful!
>>>>>>
>>>>>> Paul.
>>>>>>
>>>>>> On 2/2/12, cstrato<cstrato at aon.at> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Dear Paul,
>>>>>>>
>>>>>>> The functions root.data(), root.call() and root.expr() were created
>>>>>>> to
>>>>>>> allow you access to the corresponding root files just in case that
>>>>>>> you
>>>>>>> did not save your R session.
>>>>>>>
>>>>>>> In the cases where you compute expression levels stepwise, or only
>>>>>>> part
>>>>>>> of them such as normalize.quantiles(), as seems to be the matter in
>>>>>>> your
>>>>>>> case, there is no corresponding root.xxx() function to access the
>>>>>>> root
>>>>>>> file directly. In these cases you need to save your R session to have
>>>>>>> continued access to the resulting root file.
>>>>>>>
>>>>>>> Please note that saving the R session is the usual case to have
>>>>>>> access
>>>>>>> to the root files.
>>>>>>>
>>>>>>> Best regards
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>> On 2/2/12 1:12 PM, Paul Geeleher wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Thanks for your quick reply. I check what kind of trees I have using
>>>>>>>> "getTreeNames()" as you'd suggested, it seems they are of type "cqu"
>>>>>>>> rather than "int", this is presumably because my analysis required
>>>>>>>> no
>>>>>>>> background correction step?
>>>>>>>>
>>>>>>>> So I then tried:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root",
>>>>>>>>> "cqu")
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> but that gives me a huge number of errors that look like this:
>>>>>>>>
>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>> Error: Could not get tree<ExportSet>.
>>>>>>>> Error in root.expr(scheme.huex10stv2, "exon_quantiles.root", :
>>>>>>>> error in function ‘ExportData’
>>>>>>>>
>>>>>>>>
>>>>>>>> This file "exon_quantiles.root" definitely exists in the current
>>>>>>>> working directory though... Thanks again for your help!
>>>>>>>>
>>>>>>>> Paul.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 1, 2012 at 9:01 PM, cstrato<cstrato at aon.at> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear Paul,
>>>>>>>>>
>>>>>>>>> Please have a look at the help ?root.expr.
>>>>>>>>>
>>>>>>>>> If I understand you correctly, you did only do quantile
>>>>>>>>> normalization?
>>>>>>>>>
>>>>>>>>> To see the tree names in your file you should do:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> getTreeNames("exon_quantiles.root")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You will probably see trees with extension "int", see help
>>>>>>>>> ?validTreetype.
>>>>>>>>>
>>>>>>>>> To load these trees you need to do:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root",
>>>>>>>>>> "int")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know if this did solve your problem.
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>> Christian
>>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>>> C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a
>>>>>>>>> V.i.e.n.n.a A.u.s.t.r.i.a
>>>>>>>>> e.m.a.i.l: cstrato at aon.at
>>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/1/12 7:07 PM, Paul Geeleher wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've used xps to quantiles normalize (at probe level) some Affy
>>>>>>>>>> Exon
>>>>>>>>>> Array data. I now have a "root" file called "exon_quantiles.root",
>>>>>>>>>> but
>>>>>>>>>> if I try to load it the same was I'd load my raw data (using the
>>>>>>>>>> scheme file I created for Affy exon arrays) I get the error below?
>>>>>>>>>> I
>>>>>>>>>> can load my raw data just fine though. Any ideas? Do I perhaps
>>>>>>>>>> need
>>>>>>>>>> a
>>>>>>>>>> different "root scheme" file for this normalized data?
>>>>>>>>>> Unfortunately,
>>>>>>>>>> I haven't been able to find an answer.
>>>>>>>>>>
>>>>>>>>>>> scheme.huex10stv2<- root.scheme("huex10stv2.root")
>>>>>>>>>>> data_qn<- root.data(scheme.huex10stv2, "exon_quantiles.root")
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Error in if (chipname != treetitle) { : argument is of length zero
>>>>>>>>>>
>>>>>>>>>> Hope someone can help,
>>>>>>>>>>
>>>>>>>>>> Paul.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> sessionInfo()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> R version 2.11.0 (2010-04-22)
>>>>>>>>>> x86_64-redhat-linux-gnu
>>>>>>>>>>
>>>>>>>>>> locale:
>>>>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>>>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>>>
>>>>>>>>>> attached base packages:
>>>>>>>>>> [1] stats graphics grDevices utils datasets methods
>>>>>>>>>> base
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
--
Paul Geeleher (PhD Student)
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com
More information about the Bioconductor
mailing list