[BioC] Loading quantiles normalized root data in XPS

Paul Geeleher paulgeeleher at gmail.com
Tue Feb 14 13:54:56 CET 2012


Hi Christian,

Thanks for your replies.

As per your suggestions I've been been able to map between xy
co-ordinates, probe IDs and PROBESET_IDS.

I've been able to extract the expression level for a given gene (for a
subset of samples) using this:

id <- symbol2unitID(scheme.huex10stv2 , symbol="CDH11",
unittype="transcript", as.list =TRUE)
treenames <- unlist(treeNames(data_qn))
data_qn <- attachInten(data_qn, treenames=treenames[1:3])
xyIntens <- validData(data_qn, unitID=unlist(id))

With "validData()" is that it seems I can only get the expression
levels for samples which are "attached". This means I'd probably have
to attach each sample separately (which will take a long time) as I
don't have enough memory to attach all samples.

I'm wondering if there is a way to extract this data without having to
attach the data? If something similar is possible for the dabg data
(at probeset level) that would also be a big help in speeding up my
pipeline (which will be complete if I can implement these last two
steps)!

Thanks again for all your help its much appreciated,

Paul.


On Tue, Feb 7, 2012 at 8:48 PM, cstrato <cstrato at aon.at> wrote:
> Dear Paul,
>
> see replies below:
>
>
> On 2/7/12 1:52 PM, Paul Geeleher wrote:
>>
>> Hi Christian, apologies for the lack of clarity in my previous email,
>> I'll try to clear this up, see replies below:
>>
>> On Mon, Feb 6, 2012 at 7:51 PM, cstrato<cstrato at aon.at>  wrote:
>>>
>>> Dear Paul,
>>>
>>> I am afraid that it is not quite clear to me what you want to do.
>>>
>>> What do you mean with "obtain the expression levels of the probes"? Do
>>> you
>>> mean "probes" (i.e. oligos) or do you mean "probesets" as in your
>>> question
>>> about DABG?
>>
>>
>> I've read in and quantiles normalized probe level data for 176 arrays
>> and what I'm trying to do now is set every probe (individual probe,
>> not probeset) that is not detected above background to zero. But from
>> what I've read dabg.call() can only create p-values for probeset
>> level? So as compromise I'm now trying to set the expression of all
>> *individual probes* which belong to a *probeset* not detected above
>> background to zero. I hope that makes sense.
>>
>
>
> Did you read the exon array whitepapers which you can download from
> Affymetrix?
>
> Every probeset consists of 1-4 probes only, and every exon consists usually
> of 1-2 probesets. Each gene has a transcript_cluster_id, which consists of
> one or more probeset_ids. (You can see the mapping between ids using
> function export.scheme(..,treetype="anp",..)
>
> Since the smallest unit is the probeset, function dabg.call() will only work
> at the probeset (and transcript) level. If you set all probes of a probeset
> to zero you may loose an entire exon.
>
>
>
>>>
>>> How did you create "data_hapMap"? Maybe you could send me your complete
>>> code
>>> otherwise I am only able to guess.
>>
>>
>> "data_hapMap" was a mistake, sorry I copied the wrong object name it
>> should have been "data_qn" which is quantiles normalized probe level
>> data that I created like this:
>>
>> data_hapMap<- root.data(scheme.huex10stv2,
>> "/data2/paul/normalization_project/root_data/hapMap_root_data_cel.root")
>> #read raw data
>> data_qn<- normalize.quantiles(data_hapMap, "exon_quantiles",
>> filedir=rootdata_Dir, exonlevel="all") #quantiles normalize at *probe*
>> level
>>
>> So "data_qn" is the object I want to work with, based on your advice
>> I've figured out that to access the expression levels in "data_qn" I
>> need to use "attachInten()", then I can use "intensities()" to access
>> the expression levels:
>>
>> treenames<- unlist(treeNames(data_qn))
>> data_qn<- attachInten(data_qn, treenames=treenames[1:2]) #attach the
>> first two samples
>>
>>> head(intensity(data_qn))
>>
>>   X Y GSM188869.cqu_MEAN GSM188870.cqu_MEAN
>> 1 0 0             0.0000              0.000
>> 2 1 0             0.0000              0.000
>> 3 2 0             0.0000              0.000
>> 4 3 0             0.0000              0.000
>> 5 4 0             0.0000              0.000
>> 6 5 0            85.3523            121.733
>>
>> Does the X column in the output above represent the probe ID? If so
>> and I have a mapping of probeset IDs to corresponding probe IDs, it
>> should be fairly straightforward to set probes that are not detected
>> above background to zero? Perhaps there is a more straightforward way
>> of doing this though?
>>
>
> No, (X,Y) are the coordinates of the single probes on the exon array. You
> can use function export.scheme(..,treetype="scm",..) to get the mapping
> between (X,Y) and an internal PROBESET_ID, e.g.:
>
> UNIT_ID X       Y       ProbeLength     Mask    EXON_ID PROBESET_ID
> 31      986     1674    25      512     31      31
> 31      1092    677     25      512     31      31
> 31      796     1862    25      512     31      31
> 31      917     193     25      512     31      31
> 31      341     1677    25      512     32      32
> 31      144     2250    25      512     32      32
> 31      689     262     25      512     32      32
> 31      579     1670    25      512     32      32
>
> Then you can use export.scheme(..,treetype="pbs",..) to map PROBESET_ID
> (=UNIT_ID) to the ProbesetID, e.g.:
>
> UNIT_ID ProbesetID      NumCells        NumAtoms        NumSubunits
> UnitType
> 31      2315101 4       4       1       512
> 32      2315102 4       4       1       512
>
> As you see PROBESET_IDs 31 and 32 have each 4 probes and belong to the
> Affymetrix ProbesetIDs 2315101 and 2315102, respectively.
>
> You could also use functions indexUnits(), and unitID2probesetID() or
> unitID2transcriptID(), respectively.
>
> Best regards
> Christian
>
>
>
>>
>>>
>>> I guess that "data_hapMap" contains the raw data. For these the slot
>>> "data"
>>> is empty to save memory. So you need to use either attachData() or
>>> attachInten(). However since you are using exon arrays you may not have
>>> enough RAM, so it would be better to use function export() or
>>> export.data(),
>>> or attach only a subset, see help ?attachData. See also vignette xps.pdf
>>> (chapter 2.3).
>>
>>
>> I think RAM shouldn't be an issue if I attach the samples one at a
>> time? (I actually have access to a machine with 32gigs RAM but ideally
>> would like to get what I'm doing to run on a standard desktop, which
>> is actually why I'm using XPS!).
>>
>>>
>>> When you talk about "expression matrix", how did you create it? Maybe you
>>> could use function validExpr(), but w/o seeing your code it is hard to
>>> tell.
>>> For DABG there are functions pvalData() and presCall(), see the examples
>>> in
>>> help ?dabg.call.
>>
>>
>> Yes I've managed to use dabg.call() at probeset level and access the
>> p-values using pvalData() alright.
>>
>> Thanks again for all of your help and patience!
>>
>> Kind Regards,
>>
>> Paul.
>>
>>
>>>
>>> Best regards
>>> Christian
>>>
>>>
>>>
>>> On 2/6/12 4:33 PM, Paul Geeleher wrote:
>>>>
>>>>
>>>> Hi Christian,
>>>>
>>>> Thanks for your quick and informative reply.
>>>>
>>>> I have re-run the analysis and saved the R objects as you suggest. The
>>>> next thing I'm trying to do is to obtain the expression levels of the
>>>> probes, but this doesn't seem to be working for me:
>>>>
>>>>> a<- validData(data_hapMap)
>>>>
>>>>
>>>> Error in .local(object, ...) : slot "data" has no data
>>>>
>>>> Based on the documentation I think validData() is the correct function.
>>>>
>>>> I've also performed probeset level DABG and I'm trying to set
>>>> individual probes which belong to probesets with DABG<    .05 to 0 in
>>>> the expression matrix.
>>>>
>>>> But it seems I can't see the expression matrix using validData().
>>>> Perhaps there is another function. Any ideas?
>>>>
>>>> Thank you again for your help with this, I'm very grateful!
>>>>
>>>> Paul.
>>>>
>>>> On 2/2/12, cstrato<cstrato at aon.at>    wrote:
>>>>>
>>>>>
>>>>> Dear Paul,
>>>>>
>>>>> The functions root.data(), root.call() and root.expr() were created to
>>>>> allow you access to the corresponding root files just in case that you
>>>>> did not save your R session.
>>>>>
>>>>> In the cases where you compute expression levels stepwise, or only part
>>>>> of them such as normalize.quantiles(), as seems to be the matter in
>>>>> your
>>>>> case, there is no corresponding root.xxx() function to access the root
>>>>> file directly. In these cases you need to save your R session to have
>>>>> continued access to the resulting root file.
>>>>>
>>>>> Please note that saving the R session is the usual case to have access
>>>>> to the root files.
>>>>>
>>>>> Best regards
>>>>> Christian
>>>>>
>>>>>
>>>>> On 2/2/12 1:12 PM, Paul Geeleher wrote:
>>>>>>
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Thanks for your quick reply. I check what kind of trees I have using
>>>>>> "getTreeNames()" as you'd suggested, it seems they are of type "cqu"
>>>>>> rather than "int", this is presumably because my analysis required no
>>>>>> background correction step?
>>>>>>
>>>>>> So I then tried:
>>>>>>>
>>>>>>>
>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root", "cqu")
>>>>>>
>>>>>>
>>>>>>
>>>>>> but that gives me a huge number of errors that look like this:
>>>>>>
>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>> Error: Could not get tree<ExportSet>.
>>>>>> Error in root.expr(scheme.huex10stv2, "exon_quantiles.root",  :
>>>>>>    error in function ‘ExportData’
>>>>>>
>>>>>>
>>>>>> This file "exon_quantiles.root" definitely exists in the current
>>>>>> working directory though... Thanks again for your help!
>>>>>>
>>>>>> Paul.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 1, 2012 at 9:01 PM, cstrato<cstrato at aon.at>     wrote:
>>>>>>>
>>>>>>>
>>>>>>> Dear Paul,
>>>>>>>
>>>>>>> Please have a look at the help ?root.expr.
>>>>>>>
>>>>>>> If I understand you correctly, you did only do quantile
>>>>>>> normalization?
>>>>>>>
>>>>>>> To see the tree names in your file you should do:
>>>>>>>>
>>>>>>>>
>>>>>>>> getTreeNames("exon_quantiles.root")
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You will probably see trees with extension "int", see help
>>>>>>> ?validTreetype.
>>>>>>>
>>>>>>> To load these trees you need to do:
>>>>>>>>
>>>>>>>>
>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root", "int")
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know if this did solve your problem.
>>>>>>>
>>>>>>> Best regards
>>>>>>> Christian
>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>> C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
>>>>>>> V.i.e.n.n.a           A.u.s.t.r.i.a
>>>>>>> e.m.a.i.l:        cstrato at aon.at
>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/1/12 7:07 PM, Paul Geeleher wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've used xps to quantiles normalize (at probe level) some Affy Exon
>>>>>>>> Array data. I now have a "root" file called "exon_quantiles.root",
>>>>>>>> but
>>>>>>>> if I try to load it the same was I'd load my raw data (using the
>>>>>>>> scheme file I created for Affy exon arrays) I get the error below? I
>>>>>>>> can load my raw data just fine though. Any ideas? Do I perhaps need
>>>>>>>> a
>>>>>>>> different "root scheme" file for this normalized data?
>>>>>>>> Unfortunately,
>>>>>>>> I haven't been able to find an answer.
>>>>>>>>
>>>>>>>>> scheme.huex10stv2<- root.scheme("huex10stv2.root")
>>>>>>>>> data_qn<- root.data(scheme.huex10stv2, "exon_quantiles.root")
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Error in if (chipname != treetitle) { : argument is of length zero
>>>>>>>>
>>>>>>>> Hope someone can help,
>>>>>>>>
>>>>>>>> Paul.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> sessionInfo()
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> R version 2.11.0 (2010-04-22)
>>>>>>>> x86_64-redhat-linux-gnu
>>>>>>>>
>>>>>>>> locale:
>>>>>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>
>>>>>>>> attached base packages:
>>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



-- 
Paul Geeleher (PhD Student)
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com



More information about the Bioconductor mailing list