[BioC] matchPDict with mismatches allowed appears to drop names

Fri Aug 5 08:43:25 CEST 2011

Hi Ian,

On 11-08-04 08:11 AM, Ian Henry wrote:
> Hello again,
>
> I'm not sure whether to append this or add a new thread but I have
> another issue with matchPDict.
>
> I can create my PDict object:
> ps_pdict<-PDict(probeset, max.mismatch=max_mis)
>
>
> I then use vwhichPDict to filter for transcripts that match one of my
> probes
> tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary,
> max.mismatch=max_mis, min.mismatch=min_mis)
>
> and then run matchPDict over my list of transcripts that have a probe
> match using various numbers of mismatches for fuzzy matching.
> txmatches<- matchPDict(ps_pdict, transcript[[1]],
> max.mismatch=max_mis, min.mismatch=min_mis)
>
> Exact matches work well, as do 1 and 2 mismatches, but above 3 things
> go a bit wrong and I get back results with only 2 mismatches with
> max.mismatch=3, and min.mismatch=3
>
> Exact Matches:
> "names
> "	"start
> "	"end
> "	"width
> "	"TranscriptID
> "	"Matched_Sequence
> "	"probe_sequence
> "	"maxMismatchSet
> "	"minMismatchSet"	"MisMatch_Position(s)"	"NumberOfMismatchesObserved"	
> "HW:11"	169	193	
> 25
> 	"NM_001144941.1
> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	"0"	
> "HW:11"	169	193	
> 25
> 	"NM_001144940.1
> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	"0"	
> "HW:11"	169	193	
> 25
> 	"NM_182566.2"	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	
> 0	0	""	"0"	
> "HW:11"	169	193	
> 25
> 	"NM_001144939.1
> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	
> "HW:13"	120	144	
> 25
> 	"NM_007128.2"	"GCCCTTGGAACCACAATCCGCCTCA"	"GCCCTTGGAACCACAATCCGCCTCA"	
> 0	0	""	"0"	
>
> One MisMatch:
> "names
> "	"start
> "	"end
> "	"width
> "	"TranscriptID
> "	"Matched_Sequence
> "	"probe_sequence
> "	"maxMismatchSet
> "	"minMismatchSet"	"MisMatch_Position(s)"	"NumberOfMismatchesObserved"
> "HW:9"	826	850	
> 25
> 	"NM_198152.3"	"CCTAACTTTGTTATCCGTGTTGAGT"	"CCTAACTTTGTTATCCGTGTTGATT"	
> 1	1	"24"		"1"	
> "HW:18"	551	575	
> 25
> 	"NR_037144.1"	"GACTCTGCTGTGACCTCCTTGTTCA"	"GACTCTACTGTGACCTCCTTGTTCA"	
> 1	1	"7"	"1"	
>
> Two MisMatches:
> names	start	 end	width	TranscriptID	Matched_Sequence	probe_sequence	
> maxMismatchSet	minMismatchSet	MisMatch_Position(s)	
> NumberOfMismatchesObserved
> HW:22	2324	2348	25	XR_115554.1	GGAGGCCGAGGCAGGATGATTGCTT	
> GGAGGCCGAGGTAGGAGGATTGCTT	2	2	12; 17	2
> IV:14	805	829	25	XR_115163.1	CGCACACACACACACACATACACAC	
> CGCACACACACACACACACACACAT	2	2	19; 25	2
> HW::22	1078	1102	25	XR_115114.1	GGAGGCTGAGGTGGGAGGATTGCTT	
> GGAGGCCGAGGTAGGAGGATTGCTT	2	2	7; 13	2
> IV:14	108	132	25	XR_114935.1	CACACACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	2	2	2; 25	2
> IV::14	102	126	25	XR_114935.1	CGCGCACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	2	2	4; 25	2
> IV:14	106	130	25	XR_114935.1	CACACACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	2	2	2; 25	2
>
> Three MisMatches:
> names	start	end	width	TranscriptID	Matched_Sequence	probe_sequence	
> maxMismatchSet	minMismatchSet	MisMatch_Position(s)	
> NumberOfMismatchesObserved
> HGW::17::7::22	2324	2348	25	XR_115554.1	GGAGGCCGAGGCAGGATGATTGCTT	
> GGAGGCCGAGGTAGGAGGATTGCTT	3	3	12; 17	2
> INV::5::14::14	805	829	25	XR_115163.1	CGCACACACACACACACATACACAC	
> CGCACACACACACACACACACACAT	3	3	19; 25	2
> HGW::17::7::22	1078	1102	25	XR_115114.1	GGAGGCTGAGGTGGGAGGATTGCTT	
> GGAGGCCGAGGTAGGAGGATTGCTT	3	3	7; 13	2
> INV::5::14::14	108	132	25	XR_114935.1	CACACACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	3	3	2; 25	2
> INV::5::14::14	102	126	25	XR_114935.1	CGCGCACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	3	3	4; 25	2
> INV::5::14::14	106	130	25	XR_114935.1	CACACACACACACACACACACACAC	
> CGCACACACACACACACACACACAT	3	3	2; 25	2

Please provide a self-contained reproducible example.

Below I try to run something only *very* approximate to what you
seem to have done (based on the information you provide above),
but I can't reproduce the problem.

   library(Biostrings)

   ## Using the 6 probes and 6 matched sequences from your "Three
   ## MisMatches" example:
   probes <- DNAStringSet(c("GGAGGCCGAGGTAGGAGGATTGCTT",
                            "CGCACACACACACACACACACACAT",
                            "GGAGGCCGAGGTAGGAGGATTGCTT",
                            "CGCACACACACACACACACACACAT",
                            "CGCACACACACACACACACACACAT",
                            "CGCACACACACACACACACACACAT"))

   matched_sequences <- DNAStringSet(c("GGAGGCCGAGGCAGGATGATTGCTT",
                                       "CGCACACACACACACACATACACAC",
                                       "GGAGGCTGAGGTGGGAGGATTGCTT",
                                       "CACACACACACACACACACACACAC",
                                       "CGCGCACACACACACACACACACAC",
                                       "CACACACACACACACACACACACAC"))

   ## Computing the Hamming distances:
   > sapply(1:6, function(i) neditStartingAt(matched_sequences[[i]],
                                             probes[[i]]))
   [1] 2 2 2 2 2 2

   ## Using matchPDict() with max.mismatch=3 and min.mismatch=3.
   ps_pdict <- PDict(probeset, max.mismatch=3)

   ## probe 1 vs sequence 1:
   > m1 <- matchPDict(ps_pdict, matched_sequences[[1]],
                      max.mismatch=3, min.mismatch=3)
   > m1[[1]]
   IRanges of length 0
   ## --> no hit

   ## probe 2 vs sequence 2:
   > m2 <- matchPDict(ps_pdict, matched_sequences[[2]],
                      max.mismatch=3, min.mismatch=3)
   > m2[[2]]
   IRanges of length 0
   ## --> no hit

   ## probe 3 vs sequence 3:
   > m3 <- matchPDict(ps_pdict, matched_sequences[[3]],
                      max.mismatch=3, min.mismatch=3)
   > m3[[3]]
   IRanges of length 0
   ## --> no hit

   ## probe 4 vs sequence 4:
   > m4 <- matchPDict(ps_pdict, matched_sequences[[4]],
                      max.mismatch=3, min.mismatch=3)
   > m4[[4]]
   IRanges of length 2
       start end width
   [1]    -1  23    25
   [2]     3  27    25
   ## --> probe 4 hits sequence 4 twice but those 2 hits
   ## are "out-of-limit" hits. The reason those hits are
   ## considered to have 3 mismatches is that letters in
   ## the pattern that are not facing a letter in the subject
   ## are counted as mismatches:
   > neditStartingAt(probes[[4]], matched_sequences[[4]],
                     starting.at=-2:4)
   [1] 25  3 25  2 25  3 25

   ## probe 5 vs sequence 5:
   > m5 <- matchPDict(ps_pdict, matched_sequences[[5]],
                      max.mismatch=3, min.mismatch=3)
   > m5[[5]]
   IRanges of length 0
   ## --> no hit

   ## probe 6 vs sequence 6:
   > m6 <- matchPDict(ps_pdict, matched_sequences[[6]],
                      max.mismatch=3, min.mismatch=3)
   > m6[[6]]
   IRanges of length 2
       start end width
   [1]    -1  23    25
   [2]     3  27    25
   ## -> same as probe 4 vs sequence 4

   ## Consistency with vwhichPDict():

   > countIndex(m1)
   [1] 0 0 0 0 0 0
   > countIndex(m2)
   [1] 0 0 0 0 0 0
   > countIndex(m3)
   [1] 0 0 0 0 0 0
   > countIndex(m4)
   [1] 0 2 0 2 2 2
   > countIndex(m5)
   [1] 0 0 0 0 0 0
   > countIndex(m6)
   [1] 0 2 0 2 2 2

   > vwhichPDict(ps_pdict, matched_sequences,
                 max.mismatch=3, min.mismatch=3)
   [[1]]
   integer(0)

   [[2]]
   integer(0)

   [[3]]
   integer(0)

   [[4]]
   [1] 2 4 5 6

   [[5]]
   integer(0)

   [[6]]
   [1] 2 4 5 6

   ## --> things are consistent

 > sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.20.2 IRanges_1.10.4

loaded via a namespace (and not attached):
[1] tools_2.13.0

H.

>
>
> When running vwhichPDict with three mismatches I do get a warning that
> the algorithm may not be efficient, but this seems to indicate a time
> problem rather than an accuracy problem.
> Warning message:
> In .MTB_PDict(x, max.mismatch, algo) :
>     given the characteristics of dictionary 'x', this value of
> 'max.mismatch' will
>     give poor performance when you call matchPDict() on this MTB_PDict
> object
>     (it will of course depend ultimately on the length of the subject)
>
> Indeed the output looks OK:
>
> Indeed the results of vwhichPDict gives:
> 6428 tx_matches for probes allowing max 0 min 0 mismatches
> 1077 tx_matches for probes allowing max 1 min 1 mismatches
> 1953 tx_matches for probes allowing max 2 min 2 mismatches
> 3729 tx_matches for probes allowing max 3 min 3 mismatches
>
> However, If I compare this to the output of looping over matchPDict
> for transcripts with probe matches (from vwhichPDict) I get:
> 6428 unique tx_matches for probes allowing max 0 min 0 mismatches
> 1077 unique tx_matches for probes allowing max 1 min 1 mismatches
> 1953 unique tx_matches for probes allowing max 2 min 2 mismatches
> 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but
> the results show only 2 mismatches
>
> Suggesting that matchPDict reverted back to 2 mismatch settings
> (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3,
> min.mismatch=3.  (It also seems to be the case for max=3, min=0)
>
> Probably 3 mismatches is excessive for what I want to do, but I
> thought I'd report the issue.  It seems like matchPDict is reverting
> back to 2 mismatches in this case but vwhichPDict probably does 3.
>
> Thanks,
>
> Ian
>
>
>
> On Aug 3, 2011, at 11:06 AM, Ian Henry wrote:
>
>> Excellent thanks for the help and the fix!
>>
>> The workaround works well for me for now.  I did spend a while
>> yesterday writing my own workaround but Harris' workaround is much
>> simpler for now than my solution, which is given below but is rather
>> dirty.
>>
>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>> max.mismatch=max_mis, min.mismatch=min_mis)
>>> tdf<-as.data.frame(unlist(txmatches))
>>> patternIndex<-which(countIndex(txmatches)>0)
>>> duptimes<- function(duptimes)
>> rep(duptimes,length(txmatches[[duptimes]]))
>>> tmp<-sapply(patternIndex,duptimes)
>>> tdf$patternIndex<-unlist(tmp)
>>> getNames<- function(getNames) names(ps_pdict)[getNames]
>>> tdf$name<-sapply(tdf$patternIndex, getNames)
>>
>>
>>
>> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote:
>>
>>> Hi Ian, Harris,
>>>
>>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote:
>>>> The short answer for the fuzzy matching case appears to be that,
>>>> even
>>>> though your names are maintained through most of the function
>>>>
>>>> matchPDict.R:.match.MTB_PDict,
>>>>
>>>> and in fact they are still present in the list 'ans_compons', they
>>>> get
>>>> dropped in this call:
>>>>
>>>> ans<- ByPos_MIndex.combine(ans_compons)
>>>>
>>>> By contrast, for exact matching, the function
>>>>
>>>> matchPDict.R:.match.TB_PDict
>>>>
>>>> returns your names here:
>>>>
>>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict),
>>>> ends=C_ans)
>>>>
>>>> So my naive guess is that
>>>>
>>>> MIndex-class.R:ByPos_MIndex.combine
>>>>
>>>> should do that also, something like
>>>>
>>>> ans_names<- mi_list[[1]]@NAMES
>>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends,
>>>> NAMES=ans_names)
>>>
>>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and
>>> will be soon in Biostrings devel. With Biostrings 2.20.2:
>>>
>>>   library(Biostrings)
>>>   probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc"))
>>>   pdict1<- PDict(probes, max.mismatch=1)
>>>   m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1)
>>>
>>> Then:
>>>
>>>   >  names(m1)
>>>   [1] "probe1" "probe2"
>>>
>>>   >  unlist(m1)
>>>   IRanges of length 3
>>>       start end width  names
>>>   [1]     2   7     6 probe1
>>>   [2]     3   8     6 probe1
>>>   [3]     7  12     6 probe2
>>>
>>> Thanks guys for the report and for the fix!
>>> H.
>>>
>>>>
>>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question regarding the inheritance of the names attribute
>>>>> when using matchPDict.
>>>>>
>>>>> If I use matchPDict as follows:
>>>>>
>>>>> #Get transcript information
>>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>>>>> "refGene")
>>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>>>>
>>>>> #Create DNAStringSet with names associated with each probe
>>>>>> probeset<- DNAStringSet(probelist$sequence)
>>>>>> names(probeset)<-probelist$probenames
>>>>>
>>>>> #Create PDict object and match against human transcript 14 (I
>>>>> know it
>>>>> should match)
>>>>>> ps_pdict<-PDict(probeset)
>>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]])
>>>>>
>>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and
>>>>> gives:
>>>>>> unlist(txmatches):
>>>>>
>>>>> start end width names
>>>>> [1] 749 773 25 HW:6
>>>>> [2] 569 593 25 HW:16
>>>>> [3] 804 828 25 HW:26
>>>>> [4] 757 781 25 HW:36
>>>>>
>>>>> which works :)
>>>>>
>>>>> However, if I search allowing for mismatches then the names
>>>>> appear to
>>>>> be lost:
>>>>>
>>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1)
>>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]],
>>>>> max.mismatch=1,
>>>>> min.mismatch=0)
>>>>>> unlist(txmatches1)
>>>>>
>>>>> IRanges of length 4
>>>>> start end width
>>>>> [1] 749 773 25
>>>>> [2] 569 593 25
>>>>> [3] 804 828 25
>>>>> [4] 757 781 25
>>>>>
>>>>> The result of matchPDict is a MIndex object that I named txmatches
>>>>> with exact matches, and txmatches1 with 1 mismatch
>>>>>> names(txmatches) #gives character vector containing probe names
>>>>>> names(txmatches1) #returns NULL
>>>>>
>>>>> So it appears the names are not inherited. I tried to added them
>>>>> manually to my MIndex object
>>>>>> names(txmatches1)<-names(probeset)
>>>>>
>>>>> but I get Error:
>>>>> attempt to modify the names of a ByPos_MIndex instance
>>>>>
>>>>> Therefore I'm not sure how to keep my probe names associated with
>>>>> the
>>>>> Transcript match, which is important for inexact matching.
>>>>>
>>>>> Any help would be greatly appreciated,
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ian
>>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 2.13.0 beta (2011-03-31 r55221)
>>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>>>
>>>>> locale:
>>>>> [1] C/UTF-8/C/C/C/C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17
>>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31
>>>>> [7] IRanges_1.9.28
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0
>>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1
>>>>> [7] rtracklayer_1.11.12 tools_2.13.0
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>
>
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319