[BioC] probe alignment clustering
Sean Davis
sdavis2 at mail.nih.gov
Tue Apr 5 21:16:54 CEST 2005
On Apr 5, 2005, at 1:05 PM, Stan Smiley wrote:
> I'm in the process of validating the annotation from Affymetrix, and
> positioning each 25mer probe
> on the genome (just mouse and human so far). I'm in a position now to
> use this
> positioning data to validate the affy annotated alignments, which I'm
> doing now.
>
Stan,
My first question would be: Have you looked at the annotation done by
EnsEMBL (or any of several other groups)? Presumably, yes.
Second, are you going to be using this for expression or "genome
localization"? Keep in mind that aligning to the genome is NOT in the
same space as aligning to transcripts (there are, of course, probes
that align to the genome that do not hit transcripts, and probes that
align to transcripts that will not align to the genome--those that
cross exons). So, coming up with a consensus genomic sequence will not
necessarily be very useful by itself--you will have to re-align this to
transcripts to determine what you are measuring. An alternative method
is to do the alignments of individual probes and look for overlap with
annotated regions (exons). This, you can do in the UCSC genome browser
fairly easily. Just make a custom track of your data and then get the
genes that overlap with your annotation. Then, you can see if various
probesets generally hit the genes that affy says they do. You could do
this in R if you wanted to by downloading the table of interest from
UCSC, parsing it, and writing a function to look for overlaps.
> My challenge now is to settle on the best approach in BioC/R to find
> 'consensus' sequences
> in the genome that best match the alignments I've come up with.
Do you really want consensus genomic sequence?
> I'm
> thinking some
> clustering package, but not sure which one is most appropriate.
>
Any should in theory work, but you will still have to decide at what
"distance" to make the cuts of the tree. Given that some genes are
small, some are large, some are close to each other, some are far away
(try to think of a clustering method that can accurately define a
"gene" from the HOXA region [short, closely-spaced genes] AND the NF1
gene [long, complicated gene], as an example).
I know I didn't give you a direct answer, and you have a hard problem,
in many senses. In any case, hope this helps a bit....
Sean
More information about the Bioconductor
mailing list