[BioC] matchPDict with mismatches allowed appears to drop names

Fri Aug 5 09:02:48 CEST 2011

Answering to the list...

On 11-08-04 12:28 PM, Ian Henry wrote:
> I did still see the bug with your code but I fixed the problem:
>
> I have been making a new pdict object with the specific number of mismatches before running matchPDict.
>
> However, when preparing you a small example to illustrate my point I discovered the code worked correctly when using just one transcript.
>
> When I ran this again iterating over the whole transcriptome I noticed my code was erroring.
>
> I think the real issue is with transcript XM_003119554
>
> One probe matches it  with 3 mismatches but the match starts at -2 which screwed up my later subseq() call and then failed causing corruption between my two runs (with 2 and 3 mismatches)
>
>    start end width        names   TranscriptID
> 1    -2  22    25 KP:5 XM_003119554.1
>
> I now don't think there is an issue with matchPDict, it was just this unexpected error that I missed in my codeblock.

OK I see. Yes when doing fuzzy matching you can see "out-of-limits"
hits which can then cause you troubles downstream if not handled
carefully.

Cheers,
H.

>
> I'm still getting to grips with debugging my R code effectively and I guess I missed this one.
>
> Sorry Harris and Herve thanks for your help.
>
> Ian
>
>
>
> On Aug 4, 2011, at 6:29 PM, Harris A. Jaffee wrote:
>
>> Do you still see a bug with this, i.e. some matches with only 2 errors?
>>
>>>> 	pdict3 = PDict(probeset, max.mismatch=3)
>>>> 	m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3)
>>
>> If so, can you give a small example?
>>
>> Or did my advice just avoid your original problem?
>>
>> On Aug 4, 2011, at 11:43 AM, Ian Henry wrote:
>>
>>> Sorry I wasn't clear, I am doing that as I thought it best too.
>>>
>>> On Aug 4, 2011, at 5:38 PM, Harris A. Jaffee wrote:
>>>
>>>> I'm not certain, but I think it best to build and use a different
>>>> PDict object for each 'max_mis' value for which you are going to
>>>> call matchPDict:
>>>>
>>>> 	pdict0 = PDict(probeset)
>>>> 	pdict1 = PDict(probeset, max.mismatch=1)
>>>> 	pdict2 = PDict(probeset, max.mismatch=2)
>>>> 	pdict3 = PDict(probeset, max.mismatch=3)
>>>>
>>>> 	m0 = matchPDict(pdict0, subject0)
>>>> 	m1 = matchPDict(pdict1, subject1, max.mismatch=1, min.mismatch=1)
>>>> 	m2 = matchPDict(pdict2, subject2, max.mismatch=2, min.mismatch=2)
>>>> 	m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3)
>>>>
>>>> But you still may have found a bug.
>>>>
>>>> On Aug 4, 2011, at 11:11 AM, Ian Henry wrote:
>>>>
>>>>> Hello again,
>>>>>
>>>>> I'm not sure whether to append this or add a new thread but I have
>>>>> another issue with matchPDict.
>>>>>
>>>>> I can create my PDict object:
>>>>> ps_pdict<-PDict(probeset, max.mismatch=max_mis)
>>>>>
>>>>>
>>>>> I then use vwhichPDict to filter for transcripts that match one of my
>>>>> probes
>>>>> tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary,
>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>
>>>>> and then run matchPDict over my list of transcripts that have a probe
>>>>> match using various numbers of mismatches for fuzzy matching.
>>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>
>>>>> Exact matches work well, as do 1 and 2 mismatches, but above 3 things
>>>>> go a bit wrong and I get back results with only 2 mismatches with
>>>>> max.mismatch=3, and min.mismatch=3
>>>>>
>>>>> Exact Matches:
>>>>> "names
>>>>> "	"start
>>>>> "	"end
>>>>> "	"width
>>>>> "	"TranscriptID
>>>>> "	"Matched_Sequence
>>>>> "	"probe_sequence
>>>>> "	"maxMismatchSet
>>>>> "	"minMismatchSet"	"MisMatch_Position(s)"	"NumberOfMismatchesObserved"	
>>>>> "HW:11"	169	193	
>>>>> 25
>>>>> 	"NM_001144941.1
>>>>> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	"0"	
>>>>> "HW:11"	169	193	
>>>>> 25
>>>>> 	"NM_001144940.1
>>>>> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	"0"	
>>>>> "HW:11"	169	193	
>>>>> 25
>>>>> 	"NM_182566.2"	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	
>>>>> 0	0	""	"0"	
>>>>> "HW:11"	169	193	
>>>>> 25
>>>>> 	"NM_001144939.1
>>>>> "	"GAACGGCTACACGGCGGTCATCGAA"	"GAACGGCTACACGGCGGTCATCGAA"	0	0	""	
>>>>> "HW:13"	120	144	
>>>>> 25
>>>>> 	"NM_007128.2"	"GCCCTTGGAACCACAATCCGCCTCA"	"GCCCTTGGAACCACAATCCGCCTCA"	
>>>>> 0	0	""	"0"	
>>>>>
>>>>> One MisMatch:
>>>>> "names
>>>>> "	"start
>>>>> "	"end
>>>>> "	"width
>>>>> "	"TranscriptID
>>>>> "	"Matched_Sequence
>>>>> "	"probe_sequence
>>>>> "	"maxMismatchSet
>>>>> "	"minMismatchSet"	"MisMatch_Position(s)"	"NumberOfMismatchesObserved"
>>>>> "HW:9"	826	850	
>>>>> 25
>>>>> 	"NM_198152.3"	"CCTAACTTTGTTATCCGTGTTGAGT"	"CCTAACTTTGTTATCCGTGTTGATT"	
>>>>> 1	1	"24"		"1"	
>>>>> "HW:18"	551	575	
>>>>> 25
>>>>> 	"NR_037144.1"	"GACTCTGCTGTGACCTCCTTGTTCA"	"GACTCTACTGTGACCTCCTTGTTCA"	
>>>>> 1	1	"7"	"1"	
>>>>>
>>>>> Two MisMatches:
>>>>> names	start	 end	width	TranscriptID	Matched_Sequence	probe_sequence	
>>>>> maxMismatchSet	minMismatchSet	MisMatch_Position(s)	
>>>>> NumberOfMismatchesObserved
>>>>> HW:22	2324	2348	25	XR_115554.1	GGAGGCCGAGGCAGGATGATTGCTT	
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT	2	2	12; 17	2
>>>>> IV:14	805	829	25	XR_115163.1	CGCACACACACACACACATACACAC	
>>>>> CGCACACACACACACACACACACAT	2	2	19; 25	2
>>>>> HW::22	1078	1102	25	XR_115114.1	GGAGGCTGAGGTGGGAGGATTGCTT	
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT	2	2	7; 13	2
>>>>> IV:14	108	132	25	XR_114935.1	CACACACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	2	2	2; 25	2
>>>>> IV::14	102	126	25	XR_114935.1	CGCGCACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	2	2	4; 25	2
>>>>> IV:14	106	130	25	XR_114935.1	CACACACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	2	2	2; 25	2
>>>>>
>>>>> Three MisMatches:
>>>>> names	start	end	width	TranscriptID	Matched_Sequence	probe_sequence	
>>>>> maxMismatchSet	minMismatchSet	MisMatch_Position(s)	
>>>>> NumberOfMismatchesObserved
>>>>> HGW::17::7::22	2324	2348	25	XR_115554.1	GGAGGCCGAGGCAGGATGATTGCTT	
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT	3	3	12; 17	2
>>>>> INV::5::14::14	805	829	25	XR_115163.1	CGCACACACACACACACATACACAC	
>>>>> CGCACACACACACACACACACACAT	3	3	19; 25	2
>>>>> HGW::17::7::22	1078	1102	25	XR_115114.1	GGAGGCTGAGGTGGGAGGATTGCTT	
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT	3	3	7; 13	2
>>>>> INV::5::14::14	108	132	25	XR_114935.1	CACACACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	3	3	2; 25	2
>>>>> INV::5::14::14	102	126	25	XR_114935.1	CGCGCACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	3	3	4; 25	2
>>>>> INV::5::14::14	106	130	25	XR_114935.1	CACACACACACACACACACACACAC	
>>>>> CGCACACACACACACACACACACAT	3	3	2; 25	2
>>>>>
>>>>>
>>>>> When running vwhichPDict with three mismatches I do get a warning that
>>>>> the algorithm may not be efficient, but this seems to indicate a time
>>>>> problem rather than an accuracy problem.
>>>>> Warning message:
>>>>> In .MTB_PDict(x, max.mismatch, algo) :
>>>>>   given the characteristics of dictionary 'x', this value of
>>>>> 'max.mismatch' will
>>>>>   give poor performance when you call matchPDict() on this MTB_PDict
>>>>> object
>>>>>   (it will of course depend ultimately on the length of the subject)
>>>>>
>>>>> Indeed the output looks OK:
>>>>>
>>>>> Indeed the results of vwhichPDict gives:
>>>>> 6428 tx_matches for probes allowing max 0 min 0 mismatches
>>>>> 1077 tx_matches for probes allowing max 1 min 1 mismatches
>>>>> 1953 tx_matches for probes allowing max 2 min 2 mismatches
>>>>> 3729 tx_matches for probes allowing max 3 min 3 mismatches
>>>>>
>>>>> However, If I compare this to the output of looping over matchPDict
>>>>> for transcripts with probe matches (from vwhichPDict) I get:
>>>>> 6428 unique tx_matches for probes allowing max 0 min 0 mismatches
>>>>> 1077 unique tx_matches for probes allowing max 1 min 1 mismatches
>>>>> 1953 unique tx_matches for probes allowing max 2 min 2 mismatches
>>>>> 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but
>>>>> the results show only 2 mismatches
>>>>>
>>>>> Suggesting that matchPDict reverted back to 2 mismatch settings
>>>>> (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3,
>>>>> min.mismatch=3.  (It also seems to be the case for max=3, min=0)
>>>>>
>>>>> Probably 3 mismatches is excessive for what I want to do, but I
>>>>> thought I'd report the issue.  It seems like matchPDict is reverting
>>>>> back to 2 mismatches in this case but vwhichPDict probably does 3.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ian
>>>>>
>>>>>
>>>>>
>>>>> On Aug 3, 2011, at 11:06 AM, Ian Henry wrote:
>>>>>
>>>>>> Excellent thanks for the help and the fix!
>>>>>>
>>>>>> The workaround works well for me for now.  I did spend a while
>>>>>> yesterday writing my own workaround but Harris' workaround is much
>>>>>> simpler for now than my solution, which is given below but is rather
>>>>>> dirty.
>>>>>>
>>>>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>>> tdf<-as.data.frame(unlist(txmatches))
>>>>>>> patternIndex<-which(countIndex(txmatches)>0)
>>>>>>> duptimes<- function(duptimes)
>>>>>> rep(duptimes,length(txmatches[[duptimes]]))
>>>>>>> tmp<-sapply(patternIndex,duptimes)
>>>>>>> tdf$patternIndex<-unlist(tmp)
>>>>>>> getNames<- function(getNames) names(ps_pdict)[getNames]
>>>>>>> tdf$name<-sapply(tdf$patternIndex, getNames)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote:
>>>>>>
>>>>>>> Hi Ian, Harris,
>>>>>>>
>>>>>>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote:
>>>>>>>> The short answer for the fuzzy matching case appears to be that,
>>>>>>>> even
>>>>>>>> though your names are maintained through most of the function
>>>>>>>>
>>>>>>>> matchPDict.R:.match.MTB_PDict,
>>>>>>>>
>>>>>>>> and in fact they are still present in the list 'ans_compons', they
>>>>>>>> get
>>>>>>>> dropped in this call:
>>>>>>>>
>>>>>>>> ans<- ByPos_MIndex.combine(ans_compons)
>>>>>>>>
>>>>>>>> By contrast, for exact matching, the function
>>>>>>>>
>>>>>>>> matchPDict.R:.match.TB_PDict
>>>>>>>>
>>>>>>>> returns your names here:
>>>>>>>>
>>>>>>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict),
>>>>>>>> ends=C_ans)
>>>>>>>>
>>>>>>>> So my naive guess is that
>>>>>>>>
>>>>>>>> MIndex-class.R:ByPos_MIndex.combine
>>>>>>>>
>>>>>>>> should do that also, something like
>>>>>>>>
>>>>>>>> ans_names<- mi_list[[1]]@NAMES
>>>>>>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends,
>>>>>>>> NAMES=ans_names)
>>>>>>>
>>>>>>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and
>>>>>>> will be soon in Biostrings devel. With Biostrings 2.20.2:
>>>>>>>
>>>>>>> library(Biostrings)
>>>>>>> probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc"))
>>>>>>> pdict1<- PDict(probes, max.mismatch=1)
>>>>>>> m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1)
>>>>>>>
>>>>>>> Then:
>>>>>>>
>>>>>>>> names(m1)
>>>>>>> [1] "probe1" "probe2"
>>>>>>>
>>>>>>>> unlist(m1)
>>>>>>> IRanges of length 3
>>>>>>>     start end width  names
>>>>>>> [1]     2   7     6 probe1
>>>>>>> [2]     3   8     6 probe1
>>>>>>> [3]     7  12     6 probe2
>>>>>>>
>>>>>>> Thanks guys for the report and for the fix!
>>>>>>> H.
>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a question regarding the inheritance of the names attribute
>>>>>>>>> when using matchPDict.
>>>>>>>>>
>>>>>>>>> If I use matchPDict as follows:
>>>>>>>>>
>>>>>>>>> #Get transcript information
>>>>>>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>>>>>>>>> "refGene")
>>>>>>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>>>>>>>>
>>>>>>>>> #Create DNAStringSet with names associated with each probe
>>>>>>>>>> probeset<- DNAStringSet(probelist$sequence)
>>>>>>>>>> names(probeset)<-probelist$probenames
>>>>>>>>>
>>>>>>>>> #Create PDict object and match against human transcript 14 (I
>>>>>>>>> know it
>>>>>>>>> should match)
>>>>>>>>>> ps_pdict<-PDict(probeset)
>>>>>>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]])
>>>>>>>>>
>>>>>>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and
>>>>>>>>> gives:
>>>>>>>>>> unlist(txmatches):
>>>>>>>>>
>>>>>>>>> start end width names
>>>>>>>>> [1] 749 773 25 HW:6
>>>>>>>>> [2] 569 593 25 HW:16
>>>>>>>>> [3] 804 828 25 HW:26
>>>>>>>>> [4] 757 781 25 HW:36
>>>>>>>>>
>>>>>>>>> which works :)
>>>>>>>>>
>>>>>>>>> However, if I search allowing for mismatches then the names
>>>>>>>>> appear to
>>>>>>>>> be lost:
>>>>>>>>>
>>>>>>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1)
>>>>>>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]],
>>>>>>>>> max.mismatch=1,
>>>>>>>>> min.mismatch=0)
>>>>>>>>>> unlist(txmatches1)
>>>>>>>>>
>>>>>>>>> IRanges of length 4
>>>>>>>>> start end width
>>>>>>>>> [1] 749 773 25
>>>>>>>>> [2] 569 593 25
>>>>>>>>> [3] 804 828 25
>>>>>>>>> [4] 757 781 25
>>>>>>>>>
>>>>>>>>> The result of matchPDict is a MIndex object that I named txmatches
>>>>>>>>> with exact matches, and txmatches1 with 1 mismatch
>>>>>>>>>> names(txmatches) #gives character vector containing probe names
>>>>>>>>>> names(txmatches1) #returns NULL
>>>>>>>>>
>>>>>>>>> So it appears the names are not inherited. I tried to added them
>>>>>>>>> manually to my MIndex object
>>>>>>>>>> names(txmatches1)<-names(probeset)
>>>>>>>>>
>>>>>>>>> but I get Error:
>>>>>>>>> attempt to modify the names of a ByPos_MIndex instance
>>>>>>>>>
>>>>>>>>> Therefore I'm not sure how to keep my probe names associated with
>>>>>>>>> the
>>>>>>>>> Transcript match, which is important for inexact matching.
>>>>>>>>>
>>>>>>>>> Any help would be greatly appreciated,
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>>
>>>>>>>>> R version 2.13.0 beta (2011-03-31 r55221)
>>>>>>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>> [1] C/UTF-8/C/C/C/C
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>>>
>>>>>>>>> other attached packages:
>>>>>>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>>>>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17
>>>>>>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31
>>>>>>>>> [7] IRanges_1.9.28
>>>>>>>>>
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0
>>>>>>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1
>>>>>>>>> [7] rtracklayer_1.11.12 tools_2.13.0
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Hervé Pagès
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>> Phone:  (206) 667-5791
>>>>>>> Fax:    (206) 667-1319
>>>>>>
>>>>>
>>>>>
>>>>> 	[[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319