[BioC] probe alignment clustering

Tue Apr 5 21:16:54 CEST 2005

On Apr 5, 2005, at 1:05 PM, Stan Smiley wrote:

> I'm in the process of validating the annotation from Affymetrix, and
> positioning each 25mer probe
> on the genome (just mouse and human so far). I'm in a position now to
> use this
> positioning data to validate the affy annotated alignments, which I'm
> doing now.
>

Stan,

My first question would be:  Have you looked at the annotation done by 
EnsEMBL (or any of several other groups)?  Presumably, yes.

Second, are you going to be using this for expression or "genome 
localization"?  Keep in mind that aligning to the genome is NOT in the 
same space as aligning to transcripts (there are, of course, probes 
that align to the genome that do not hit transcripts, and probes that 
align to transcripts that will not align to the genome--those that 
cross exons).  So, coming up with a consensus genomic sequence will not 
necessarily be very useful by itself--you will have to re-align this to 
transcripts to determine what you are measuring.  An alternative method 
is to do the alignments of individual probes and look for overlap with 
annotated regions (exons).  This, you can do in the UCSC genome browser 
fairly easily.  Just make a custom track of your data and then get the 
genes that overlap with your annotation.  Then, you can see if various 
probesets generally hit the genes that affy says they do.  You could do 
this in R if you wanted to by downloading the table of interest from 
UCSC, parsing it, and writing a function to look for overlaps.

> My challenge now is to settle on the best approach in BioC/R to find
> 'consensus' sequences
> in the genome that best match the alignments I've come up with.

Do you really want consensus genomic sequence?

> I'm
> thinking some
> clustering package, but not sure which one is most appropriate.
>

Any should in theory work, but you will still have to decide at what 
"distance" to make the cuts of the tree.  Given that some genes are 
small, some are large, some are close to each other, some are far away 
(try to think of a clustering method that can accurately define a 
"gene" from the HOXA region [short, closely-spaced genes] AND the NF1 
gene [long, complicated gene], as an example).

I know I didn't give you a direct answer, and you have a hard problem, 
in many senses.  In any case, hope this helps a bit....

Sean