[BioC] hs133phsentrezg metadata

Wed Oct 18 15:15:51 CEST 2006

Hi Jim,

	It is always inaccurate to assign one or multiple chromosome locations
to each probeset for either Affy's original cdf or our custom CDFs.

	We know a probeset's genome location should be calculated from its
probes' genome location, and it should be a range or multiple ranges
under a small percentage of situations. However, it is hard to design a
universal criteria to partition probes' location, as it depends on how
wide one defines a range should be.

	Probe '1007_s_at' exactly shows this problem. it seems the result from
get("1007_s_at", hgu133plus2CHRLOC) only shows locations, instead of
ranges. In addition, the location shows in the current annotation file
is likely to be based on older version of genome assembly.

	One could argue that this result should be genome location of probes,
instead of probeset, but this probeset has 16 probes, each of which has
a genome location according to our alignment result
http://arrayanalysis.mbni.med.umich.edu/ps/ps_pb.jsp?p=1007_s_at&c=Hs133P_AFFY_ORIGINAL

	So we could add CHRLOC to annotation package in our next release of
custom CDF, but it would only indicate location of individual probes
having genomic sequence match, instead of the genomic spanning of
probesets.

Best,
Manhong Dai

On Tue, 2006-10-17 at 16:31 -0400, James W. MacDonald wrote:
> Hi Manhong,
> 
> OK, I understand that part. However, for most of the annotation data 
> (including the chromosomal location), what is normally supplied is the 
> information at the gene level, rather than the probe level. I guess one 
> could argue that knowing where exactly the probesets are supposed to 
> bind might be of interest, but the annotation packages are intended to 
> annotate probesets to genes.
> 
> While it is true that some of the probes might bind to different parts 
> of the genome, this can be handled by supplying multiple locations. For 
> instance, in the hgu133plus2 package we have:
> 
>  > get("1007_s_at", hgu133plus2CHRLOC)
>  > get("1007_s_at", hgu133plus2CHRLOC)
> 6_qbl_hap2          6 6_cox_hap1 6_qbl_hap2 6_cox_hap1
>     2098794   30959839    2300465    2099260    2300931
>           6 6_cox_hap1          6 6_qbl_hap2
>    30960305    2305069   30964443    2103398
> 
> Best,
> 
> Jim
> 
> 
> Manhong Dai wrote:
> > Hi Jim,
> > 
> > 	In our custom cdf, some hits<1 probes would be used. For example, when
> > a probe has a hit with an allele of a snp, and the snp's another allele
> > has hits=1 match with genome, although the probe has no hit with genome
> > at all, we would use this probe and its genome location as a candidate
> > for all custom CDFs, although the portion of this kind of probes is
> > small.
> > 
> > 
> > 	Our UG and ENTREZG custom CDF does have a rule that each probe must
> > only hit one genome location and one UG cluster.
> > 
> > 
> > 	But in REFSEQ custom cdf, when a probe has match to a REFSEQ sequence,
> > but no match to genome at all. The probe would still be used because
> > REFSEQ is more reliable than genome.
> > 
> > 	For example, probe 4 of
> > http://arrayanalysis.mbni.med.umich.edu/ps/ps_pb.jsp?p=NM_000019_at&c=Hs133P_Hs_REFSEQ_8  has no match to genome.
> > 
> > 
> > Best,
> > Manhong Dai
> > 
> > 
> > 	
> > On Tue, 2006-10-17 at 14:46 -0400, James W. MacDonald wrote:
> > 
> >>Hi Manhong,
> >>
> >>Manhong Dai wrote:
> >>
> >>>Hi An,
> >>>
> >>>	Our custom CDF annotation package has only gene name for each probeset
> >>>because we designed it this way.
> >>>
> >>>	A probeset's probes could have matches on different location or
> >>>chromosomes, even some probes have no match on genome at all, but they
> >>>belong to this probeset because they all have perfect match on the
> >>>gene's sequence.
> >>
> >>This doesn't make sense to me. How can a probe not match to the genome, 
> >>yet have a perfect match to a gene's sequence?
> >>
> >>I was also under the impression that the matching for the probes that 
> >>remain in an MBNI cdf was first done to the genome, and those probes 
> >>that didn't blast to the genome were discarded. From
> >>
> >>http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/cdfreadme.htm
> >>
> >>I get:
> >>
> >>A probe must only hit one UniGene cluster and one genomic location
> >>
> >>A probe must hit only one genomic location
> >>
> >>Does this mean a probe that hits < 1 genomic location will be included? 
> >>I assumed this meant a probe had to hit exactly one location.
> >>
> >>Best,
> >>
> >>Jim
> >>
> >>
> >>
> 
>