[BioC] general CGH threshold question

Fri Aug 21 19:04:10 CEST 2009

Dear Bioconductors,

for segmenting CGH data there are several packages available and most of 
them tend to give me similar overall results.
However, when it comes to make the point about a collection of 
cancer-specimens (=patients), I have to decide of how to combine all the 
so nicely segmented individual profiles.  And at some point I'm forced 
to take the arbitrary decision for a threshold deciding if a given 
position/segment from a specimen (patient) should be considered/counted 
as aberrant or not.

Of course one could say, that in theory a given segment should either be 
there as a single copy of doubled, tripled (etc..) or lost and that 
expected rations should follow this.  However in my view reality is 
quite different. Surgeons tend to remove (a bit) more tissue than the 
tumor itself, so there is reason to assume some normal tissue, plus 
tumors may be heterogeneous.  All these reasons contribute to the fact 
that I see log-rations less than +/- 1 (which would describe this ideal 
case), and I wonder how many of them could still represent "true" 
alterations.
Now I've seen people making fairly arbitrary decisions about such 
thresholds, like 0.5 (corresponds to : ~40% of molecules tested with 
doubled DNA while the rest may be normal) or other values in that 
range.  Unfortunately the biologists/clinicians can't help me on the 
question which fraction of cells should be altered to be still considered.

Now another part of the story enters the scene.  From some (preliminary) 
comparisons I've seen that Agilent software may give quite different 
results about the frequency of lost/amplified zones of the genome (while 
at least CBS, GLAD, aCGH and snapCGH were in major agreement for 
penetration counts at a given threshold - I apologize for not mentioning 
all the other BioC packages available). And not-bioinformatics people 
keep asking me why this might be so.  After all I wonder if this might 
have something to do with the choice of the threshold mentioned above.  
Of course, if you choose a threshold closer to 0 (like 0.1 or 0.2) 
you'll find more aberrations above threshold, but not just more, to my 
surprise - at sudden - entire chromosome-arms show up as enriched for 
gains or losses, making the results (a bit) more look like the Agilent 
results.
So when looking at all the distribution of all log2-ratios (say for some 
100 patients) I see a rather bell-shaped (slightly asymmetric) 
distribution. A qqplot has a slight sigmoid character and the 99.9% 
(t-distribution) confidence interval with that many df is way to close 
to 0. 

So my question : What do you suggest as a procedure to define a 
threshold to decide if a given position/segment may be considered as 
altered when piling up all the biopsies/patients in study ?

Besides statistical ideas I also wonder if anybody has data from 
comparisons with other experimental techniques to understand the "true" 
status and the discrepancy with the Agilent software ?

Thank's in advance,
Wolfgang Raffelsberger

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC,  
1 rue Laurent Fries,  67404 Illkirch  Strasbourg,  France
Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr