[BioC] Help needed with CopyNumber analysis for Affymetrix 250K arrays

Tue Feb 5 15:50:11 CET 2008

Dear Henrik

I have just downloaded and installed the aroma.affymetrix package, however,
it seems that the package enforces a certain naming convention and directory
structure, so I need some time to test it.

Best regards
Christian

==============================================
Christian Stratowa, PhD
Boehringer Ingelheim Austria
Dept NCE Lead Discovery - Bioinformatics
Dr. Boehringergasse 5-11
A-1121 Vienna, Austria
Tel.: ++43-1-80105-2470
Fax: ++43-1-80105-2782
email: christian.stratowa at vie.boehringer-ingelheim.com

-----Original Message-----
From: henrik.bengtsson at gmail.com [mailto:henrik.bengtsson at gmail.com] On
Behalf Of Henrik Bengtsson
Sent: Tuesday, February 05, 2008 12:29 AM
To: Stratowa,Dr.,Christian FEX BIG-AT-V
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Help needed with CopyNumber analysis for Affymetrix 250K
arrays

Hi Christian,

On Feb 4, 2008 1:45 AM, <Christian.Stratowa at vie.boehringer-ingelheim.com>
wrote:
> Dear all,
>
> Until now I have done all CopyNumber (and LOH) analysis using 
> Affymetrix CNAT4. However, I would prefer to use Bioconductor for this 
> purpose, thus I have a couple of questions:
>
>
> 1, Normalization and summarization of mapping array 50K and 250K 
> CEL-files:
>
> Currently, there seem to be only two packages available, which are 
> able to read mapping array CEL-files, namely: package "oligo" and 
> packages "PLASQ" and "PLASQ500K", respectively.
>
> Using package "oligo" I can do:
> > library(oligo)
> > snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
> Then I get the normalized intensities:
> > asTA250 <- antisenseThetaA(snprma250)
> > asTB250 <- antisenseThetaB(snprma250)
> > sTA250  <- senseThetaA(snprma250)
> > sTB250  <- senseThetaB(snprma250)
>
> Using package "PLASQ500K" I can do:
> > library(PLASQ500K)
> > ref <- celExtNorm("SND", "Sty")
> > sam <- celExtract("STD", "Sty")
> I get a matrix of normalized probe intensities for reference (ref) and 
> samples (sam).
>
> Are there other packages available which can use mapping array 
> CEL-files?

The aroma.affymetrix package [http://www.braju.com/R/aroma.affymetrix/] works
of CEL (and CDF) files.  See URL for examples and documentation.

>
>
> 2, Genotyping:
>
> Package "oligo" can be used for genotyping:
> > crlmmOut250 <- justCRLMM(cels250, phenoData=pheno250) genocall250 <- 
> > calls(crlmmOut250) genoconf250 <- callsConfidence(crlmmOut250)
>
> However, the following results in an error:
> > snprma250 <- justSNPRMA(cels250, phenoData=pheno250) crlmmOut250 <- 
> > crlmm(snprma250, correctionFile="outputEM.rda")
> see: 
> https://stat.ethz.ch/pipermail/bioconductor/attachments/20080128/50495
> 06c/att
> achment.pl
>
> Package "PLASQ500K" could also be used for genotyping:
> > geno <- EMSNP(???)
> Although I did not try it, this function seems to have a huge memory 
> problem, see below.
>
>
> 3, CopyNumber analysis:
>
> Although there seem to be some packages which could use the output 
> from the Affymetrix CNAT4 results, it seems that there is currently no 
> package able to do copynumber analysis for Affymetrix mapping arrays. 
> Is this correct?

The aroma.affymetrix package can estimate paired & non-paired total raw copy
numbers, cf.

H. Bengtsson; R. Irizarry; B. Carvalho; T.P. Speed, Estimation and assessment
of raw copy numbers at the single locus level, Bioinformatics, 2008. [pmid:
18204055]

The package also implement other methods for estimating raw CNs. Currently
GLAD and CBS are the supported segmentation methods, and it is not that hard
to add wrappers for other segmentation methods.

The aroma.affymetrix package does not do genotyping; at one stage there was a
wrapper to call CRLMM in 'oligo' but due to lack of time it has become
obsolete.  It's on the (way to long) to do list to fix that.

Best,

Henrik

>
> 3a, CNRLMM:
> In a Johns Hopkins Tech Report, Paper 122, 2006, Wang, Caravalho et al 
> describe a new copynumber algorithm, which they want to make available 
> at Bioconductor.
> Does anybody know when the CNRLMM algorithm will be available?
>
> 3b, PLASQ500K
> I tried to compute parent-specific copy number using PLASQ500K:
> > library(PLASQ500K)
> > psCN <-
> pscn(StyFolder="STD",normStyFolder="SND",betasSty=NULL,quantSty=NULL,b
> etasSty
> File="betasSty.Rdata",rawCNStyfile="rawCNSty.Rdata")
>
> Using only 18 250K Sty CEL-files it was impossible to finish this 
> calculation. On a 32GB RAM Linux server the job got killed, since 
> function EMSNP() which is
> called from function getBetas() used up all RAM. Starting the computation
on
> our 64GB RAM Linux server, function EMSNP() could be executed,
nevertheless,
> we had to kill the job, when it reached memory consumption of 74GB!!! at a
> later stage!
>
>
> 3c, Compute raw copy numbers for unpaired copynumber analysis: Using 
> the results from justSNPRMA() I tried to compute the copynumbers in 
> the following way:
>
> # Reference files
> snprma250ref <- justSNPRMA(cels250ref, phenoData=pheno250ref)
>
> # Sample files
> snprma250sam <- justSNPRMA(cels250sam, phenoData=pheno250sam)
>
> ## separate allels combined as in CNAT4, see 
> cnat_4_algorithm_whitepaper.pdf, page 9: # TCN(sumLog) = 
> log2(SamA/RefA) + log2(SamB/RefB)
>
> # Reference for allele A:
> # allele A as array
> ref250A <- array(NA,
> dim=c(nrow(antisenseThetaA(snprma250ref)),ncol(antisenseThetaA(snprma2
> 50ref))
> , 2),
>
> dimnames=list(rownames(antisenseThetaA(snprma250ref)),colnames(antisen
> seTheta
> A(snprma250ref)),c("antisense","sense")))
> ref250A[,,1] <- antisenseThetaA(snprma250ref)
> ref250A[,,2] <- senseThetaA(snprma250ref)
>
> # Reference A: rowMeans over sense and antisense strand
> refA <- 
> sapply(1:dim(ref250A)[2],function(x)rowMeans(ref250A[,x,],na.rm=T))
> colnames(refA) <- colnames(ref250A)
>
> # Reference for allele B:
> # allele B as array
> ref250B <- array(NA,
> dim=c(nrow(antisenseThetaB(snprma250ref)),ncol(antisenseThetaB(snprma2
> 50ref))
> , 2),
>
> dimnames=list(rownames(antisenseThetaB(snprma250ref)),colnames(antisen
> seTheta
> B(snprma250ref)),c("antisense","sense")))
> ref250B[,,1] <- antisenseThetaB(snprma250ref)
> ref250B[,,2] <- senseThetaB(snprma250ref)
>
> # Reference B: rowMeans over sense and antisense strand
> refB <- 
> sapply(1:dim(ref250B)[2],function(x)rowMeans(ref250B[,x,],na.rm=T))
> colnames(refB) <- colnames(ref250B)
>
> # Sample for allele A:
> # allele A as array
> sam250A <- array(NA,
> dim=c(nrow(antisenseThetaA(snprma250sam)),ncol(antisenseThetaA(snprma2
> 50sam))
> , 2),
>
> dimnames=list(rownames(antisenseThetaA(snprma250sam)),colnames(antisen
> seTheta
> A(snprma250sam)),c("antisense","sense")))
> sam250A[,,1] <- antisenseThetaA(snprma250sam)
> sam250A[,,2] <- senseThetaA(snprma250sam)
>
> # Sample A: rowMeans over sense and antisense strand
> samA <- 
> sapply(1:dim(sam250A)[2],function(x)rowMeans(sam250A[,x,],na.rm=T))
> colnames(samA) <- colnames(sam250A)
>
> # Sample for allele B:
> # allele B as array
> sam250B <- array(NA,
> dim=c(nrow(antisenseThetaB(snprma250sam)),ncol(antisenseThetaB(snprma2
> 50sam))
> , 2),
>
> dimnames=list(rownames(antisenseThetaB(snprma250sam)),colnames(antisen
> seTheta
> B(snprma250sam)),c("antisense","sense")))
> sam250B[,,1] <- antisenseThetaB(snprma250sam)
> sam250B[,,2] <- senseThetaB(snprma250sam)
>
> # Sample B: rowMeans over sense and antisense strand
> samB <- 
> sapply(1:dim(sam250B)[2],function(x)rowMeans(sam250B[,x,],na.rm=T))
> colnames(samB) <- colnames(sam250B)
>
> # Total CopyNumber TCN(sumLog), see cnat_4_algorithm_whitepaper.pdf, 
> page 9 TCN.sL <- (samA - rowMeans(refA)) + (samB - rowMeans(refB))
>
> # real copy number is: cn = 2^(2^cn) ?? (or 2^(cn+1) ??) cn.sL <- 
> 2^(2^TCN.sL)
> head(cn.sL)
> #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> #SNP_A-1780271            1.801377            3.034645            2.314986
> #SNP_A-1780274            2.017805            2.494345            2.370112
> #SNP_A-1780277            1.558268            2.446690            2.983195
> #SNP_A-1780278            1.879762            1.859002            1.697422
> #SNP_A-1780283            2.064631            1.639300            1.912674
> #SNP_A-1780290            2.142572            2.738094            2.029215
>
> # or alternatively: cn = 2^cnA + 2^cnB ??
> cn <- 2^(samA - rowMeans(refA)) + 2^(samB - rowMeans(refB))
> head(cn)
> #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> #SNP_A-1780271            1.859447            2.786287            2.369363
> #SNP_A-1780274            2.160573            2.315243            2.271201
> #SNP_A-1780277            3.203198            2.453341            2.773667
> #SNP_A-1780278            1.908932            1.990323            1.748716
> #SNP_A-1780283            2.046767            1.691257            1.937375
> #SNP_A-1780290            2.098621            2.416547            2.020832
>
> - Is this computation correct?
>
> - Is this way to compute the copynumbers a valuable option?
>
> - Are there any alternatives to compute the copynumbers using R 
> packages?
>
>
> Thank you in advance
> Best regards
> Christian
>
> ==============================================
> Christian Stratowa, PhD
> Boehringer Ingelheim Austria
> Dept NCE Lead Discovery - Bioinformatics
> Dr. Boehringergasse 5-11
> A-1121 Vienna, Austria
> Tel.: ++43-1-80105-2470
> Fax: ++43-1-80105-2782
> email: christian.stratowa at vie.boehringer-ingelheim.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>