[BioC] matchPDict with mismatches allowed appears to drop names
Hervé Pagès
hpages at fhcrc.org
Fri Aug 5 09:02:48 CEST 2011
Answering to the list...
On 11-08-04 12:28 PM, Ian Henry wrote:
> I did still see the bug with your code but I fixed the problem:
>
> I have been making a new pdict object with the specific number of mismatches before running matchPDict.
>
> However, when preparing you a small example to illustrate my point I discovered the code worked correctly when using just one transcript.
>
> When I ran this again iterating over the whole transcriptome I noticed my code was erroring.
>
> I think the real issue is with transcript XM_003119554
>
> One probe matches it with 3 mismatches but the match starts at -2 which screwed up my later subseq() call and then failed causing corruption between my two runs (with 2 and 3 mismatches)
>
> start end width names TranscriptID
> 1 -2 22 25 KP:5 XM_003119554.1
>
> I now don't think there is an issue with matchPDict, it was just this unexpected error that I missed in my codeblock.
OK I see. Yes when doing fuzzy matching you can see "out-of-limits"
hits which can then cause you troubles downstream if not handled
carefully.
Cheers,
H.
>
> I'm still getting to grips with debugging my R code effectively and I guess I missed this one.
>
> Sorry Harris and Herve thanks for your help.
>
> Ian
>
>
>
> On Aug 4, 2011, at 6:29 PM, Harris A. Jaffee wrote:
>
>> Do you still see a bug with this, i.e. some matches with only 2 errors?
>>
>>>> pdict3 = PDict(probeset, max.mismatch=3)
>>>> m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3)
>>
>> If so, can you give a small example?
>>
>> Or did my advice just avoid your original problem?
>>
>> On Aug 4, 2011, at 11:43 AM, Ian Henry wrote:
>>
>>> Sorry I wasn't clear, I am doing that as I thought it best too.
>>>
>>> On Aug 4, 2011, at 5:38 PM, Harris A. Jaffee wrote:
>>>
>>>> I'm not certain, but I think it best to build and use a different
>>>> PDict object for each 'max_mis' value for which you are going to
>>>> call matchPDict:
>>>>
>>>> pdict0 = PDict(probeset)
>>>> pdict1 = PDict(probeset, max.mismatch=1)
>>>> pdict2 = PDict(probeset, max.mismatch=2)
>>>> pdict3 = PDict(probeset, max.mismatch=3)
>>>>
>>>> m0 = matchPDict(pdict0, subject0)
>>>> m1 = matchPDict(pdict1, subject1, max.mismatch=1, min.mismatch=1)
>>>> m2 = matchPDict(pdict2, subject2, max.mismatch=2, min.mismatch=2)
>>>> m3 = matchPDict(pdict3, subject3, max.mismatch=3, min.mismatch=3)
>>>>
>>>> But you still may have found a bug.
>>>>
>>>> On Aug 4, 2011, at 11:11 AM, Ian Henry wrote:
>>>>
>>>>> Hello again,
>>>>>
>>>>> I'm not sure whether to append this or add a new thread but I have
>>>>> another issue with matchPDict.
>>>>>
>>>>> I can create my PDict object:
>>>>> ps_pdict<-PDict(probeset, max.mismatch=max_mis)
>>>>>
>>>>>
>>>>> I then use vwhichPDict to filter for transcripts that match one of my
>>>>> probes
>>>>> tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary,
>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>
>>>>> and then run matchPDict over my list of transcripts that have a probe
>>>>> match using various numbers of mismatches for fuzzy matching.
>>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>
>>>>> Exact matches work well, as do 1 and 2 mismatches, but above 3 things
>>>>> go a bit wrong and I get back results with only 2 mismatches with
>>>>> max.mismatch=3, and min.mismatch=3
>>>>>
>>>>> Exact Matches:
>>>>> "names
>>>>> " "start
>>>>> " "end
>>>>> " "width
>>>>> " "TranscriptID
>>>>> " "Matched_Sequence
>>>>> " "probe_sequence
>>>>> " "maxMismatchSet
>>>>> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved"
>>>>> "HW:11" 169 193
>>>>> 25
>>>>> "NM_001144941.1
>>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0"
>>>>> "HW:11" 169 193
>>>>> 25
>>>>> "NM_001144940.1
>>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0"
>>>>> "HW:11" 169 193
>>>>> 25
>>>>> "NM_182566.2" "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA"
>>>>> 0 0 "" "0"
>>>>> "HW:11" 169 193
>>>>> 25
>>>>> "NM_001144939.1
>>>>> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 ""
>>>>> "HW:13" 120 144
>>>>> 25
>>>>> "NM_007128.2" "GCCCTTGGAACCACAATCCGCCTCA" "GCCCTTGGAACCACAATCCGCCTCA"
>>>>> 0 0 "" "0"
>>>>>
>>>>> One MisMatch:
>>>>> "names
>>>>> " "start
>>>>> " "end
>>>>> " "width
>>>>> " "TranscriptID
>>>>> " "Matched_Sequence
>>>>> " "probe_sequence
>>>>> " "maxMismatchSet
>>>>> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved"
>>>>> "HW:9" 826 850
>>>>> 25
>>>>> "NM_198152.3" "CCTAACTTTGTTATCCGTGTTGAGT" "CCTAACTTTGTTATCCGTGTTGATT"
>>>>> 1 1 "24" "1"
>>>>> "HW:18" 551 575
>>>>> 25
>>>>> "NR_037144.1" "GACTCTGCTGTGACCTCCTTGTTCA" "GACTCTACTGTGACCTCCTTGTTCA"
>>>>> 1 1 "7" "1"
>>>>>
>>>>> Two MisMatches:
>>>>> names start end width TranscriptID Matched_Sequence probe_sequence
>>>>> maxMismatchSet minMismatchSet MisMatch_Position(s)
>>>>> NumberOfMismatchesObserved
>>>>> HW:22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 12; 17 2
>>>>> IV:14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC
>>>>> CGCACACACACACACACACACACAT 2 2 19; 25 2
>>>>> HW::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 7; 13 2
>>>>> IV:14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 2 2 2; 25 2
>>>>> IV::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 2 2 4; 25 2
>>>>> IV:14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 2 2 2; 25 2
>>>>>
>>>>> Three MisMatches:
>>>>> names start end width TranscriptID Matched_Sequence probe_sequence
>>>>> maxMismatchSet minMismatchSet MisMatch_Position(s)
>>>>> NumberOfMismatchesObserved
>>>>> HGW::17::7::22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 12; 17 2
>>>>> INV::5::14::14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC
>>>>> CGCACACACACACACACACACACAT 3 3 19; 25 2
>>>>> HGW::17::7::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT
>>>>> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 7; 13 2
>>>>> INV::5::14::14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 3 3 2; 25 2
>>>>> INV::5::14::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 3 3 4; 25 2
>>>>> INV::5::14::14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC
>>>>> CGCACACACACACACACACACACAT 3 3 2; 25 2
>>>>>
>>>>>
>>>>> When running vwhichPDict with three mismatches I do get a warning that
>>>>> the algorithm may not be efficient, but this seems to indicate a time
>>>>> problem rather than an accuracy problem.
>>>>> Warning message:
>>>>> In .MTB_PDict(x, max.mismatch, algo) :
>>>>> given the characteristics of dictionary 'x', this value of
>>>>> 'max.mismatch' will
>>>>> give poor performance when you call matchPDict() on this MTB_PDict
>>>>> object
>>>>> (it will of course depend ultimately on the length of the subject)
>>>>>
>>>>> Indeed the output looks OK:
>>>>>
>>>>> Indeed the results of vwhichPDict gives:
>>>>> 6428 tx_matches for probes allowing max 0 min 0 mismatches
>>>>> 1077 tx_matches for probes allowing max 1 min 1 mismatches
>>>>> 1953 tx_matches for probes allowing max 2 min 2 mismatches
>>>>> 3729 tx_matches for probes allowing max 3 min 3 mismatches
>>>>>
>>>>> However, If I compare this to the output of looping over matchPDict
>>>>> for transcripts with probe matches (from vwhichPDict) I get:
>>>>> 6428 unique tx_matches for probes allowing max 0 min 0 mismatches
>>>>> 1077 unique tx_matches for probes allowing max 1 min 1 mismatches
>>>>> 1953 unique tx_matches for probes allowing max 2 min 2 mismatches
>>>>> 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but
>>>>> the results show only 2 mismatches
>>>>>
>>>>> Suggesting that matchPDict reverted back to 2 mismatch settings
>>>>> (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3,
>>>>> min.mismatch=3. (It also seems to be the case for max=3, min=0)
>>>>>
>>>>> Probably 3 mismatches is excessive for what I want to do, but I
>>>>> thought I'd report the issue. It seems like matchPDict is reverting
>>>>> back to 2 mismatches in this case but vwhichPDict probably does 3.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ian
>>>>>
>>>>>
>>>>>
>>>>> On Aug 3, 2011, at 11:06 AM, Ian Henry wrote:
>>>>>
>>>>>> Excellent thanks for the help and the fix!
>>>>>>
>>>>>> The workaround works well for me for now. I did spend a while
>>>>>> yesterday writing my own workaround but Harris' workaround is much
>>>>>> simpler for now than my solution, which is given below but is rather
>>>>>> dirty.
>>>>>>
>>>>>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>>>>>> max.mismatch=max_mis, min.mismatch=min_mis)
>>>>>>> tdf<-as.data.frame(unlist(txmatches))
>>>>>>> patternIndex<-which(countIndex(txmatches)>0)
>>>>>>> duptimes<- function(duptimes)
>>>>>> rep(duptimes,length(txmatches[[duptimes]]))
>>>>>>> tmp<-sapply(patternIndex,duptimes)
>>>>>>> tdf$patternIndex<-unlist(tmp)
>>>>>>> getNames<- function(getNames) names(ps_pdict)[getNames]
>>>>>>> tdf$name<-sapply(tdf$patternIndex, getNames)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote:
>>>>>>
>>>>>>> Hi Ian, Harris,
>>>>>>>
>>>>>>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote:
>>>>>>>> The short answer for the fuzzy matching case appears to be that,
>>>>>>>> even
>>>>>>>> though your names are maintained through most of the function
>>>>>>>>
>>>>>>>> matchPDict.R:.match.MTB_PDict,
>>>>>>>>
>>>>>>>> and in fact they are still present in the list 'ans_compons', they
>>>>>>>> get
>>>>>>>> dropped in this call:
>>>>>>>>
>>>>>>>> ans<- ByPos_MIndex.combine(ans_compons)
>>>>>>>>
>>>>>>>> By contrast, for exact matching, the function
>>>>>>>>
>>>>>>>> matchPDict.R:.match.TB_PDict
>>>>>>>>
>>>>>>>> returns your names here:
>>>>>>>>
>>>>>>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict),
>>>>>>>> ends=C_ans)
>>>>>>>>
>>>>>>>> So my naive guess is that
>>>>>>>>
>>>>>>>> MIndex-class.R:ByPos_MIndex.combine
>>>>>>>>
>>>>>>>> should do that also, something like
>>>>>>>>
>>>>>>>> ans_names<- mi_list[[1]]@NAMES
>>>>>>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends,
>>>>>>>> NAMES=ans_names)
>>>>>>>
>>>>>>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and
>>>>>>> will be soon in Biostrings devel. With Biostrings 2.20.2:
>>>>>>>
>>>>>>> library(Biostrings)
>>>>>>> probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc"))
>>>>>>> pdict1<- PDict(probes, max.mismatch=1)
>>>>>>> m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1)
>>>>>>>
>>>>>>> Then:
>>>>>>>
>>>>>>>> names(m1)
>>>>>>> [1] "probe1" "probe2"
>>>>>>>
>>>>>>>> unlist(m1)
>>>>>>> IRanges of length 3
>>>>>>> start end width names
>>>>>>> [1] 2 7 6 probe1
>>>>>>> [2] 3 8 6 probe1
>>>>>>> [3] 7 12 6 probe2
>>>>>>>
>>>>>>> Thanks guys for the report and for the fix!
>>>>>>> H.
>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a question regarding the inheritance of the names attribute
>>>>>>>>> when using matchPDict.
>>>>>>>>>
>>>>>>>>> If I use matchPDict as follows:
>>>>>>>>>
>>>>>>>>> #Get transcript information
>>>>>>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>>>>>>>>> "refGene")
>>>>>>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>>>>>>>>
>>>>>>>>> #Create DNAStringSet with names associated with each probe
>>>>>>>>>> probeset<- DNAStringSet(probelist$sequence)
>>>>>>>>>> names(probeset)<-probelist$probenames
>>>>>>>>>
>>>>>>>>> #Create PDict object and match against human transcript 14 (I
>>>>>>>>> know it
>>>>>>>>> should match)
>>>>>>>>>> ps_pdict<-PDict(probeset)
>>>>>>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]])
>>>>>>>>>
>>>>>>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and
>>>>>>>>> gives:
>>>>>>>>>> unlist(txmatches):
>>>>>>>>>
>>>>>>>>> start end width names
>>>>>>>>> [1] 749 773 25 HW:6
>>>>>>>>> [2] 569 593 25 HW:16
>>>>>>>>> [3] 804 828 25 HW:26
>>>>>>>>> [4] 757 781 25 HW:36
>>>>>>>>>
>>>>>>>>> which works :)
>>>>>>>>>
>>>>>>>>> However, if I search allowing for mismatches then the names
>>>>>>>>> appear to
>>>>>>>>> be lost:
>>>>>>>>>
>>>>>>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1)
>>>>>>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]],
>>>>>>>>> max.mismatch=1,
>>>>>>>>> min.mismatch=0)
>>>>>>>>>> unlist(txmatches1)
>>>>>>>>>
>>>>>>>>> IRanges of length 4
>>>>>>>>> start end width
>>>>>>>>> [1] 749 773 25
>>>>>>>>> [2] 569 593 25
>>>>>>>>> [3] 804 828 25
>>>>>>>>> [4] 757 781 25
>>>>>>>>>
>>>>>>>>> The result of matchPDict is a MIndex object that I named txmatches
>>>>>>>>> with exact matches, and txmatches1 with 1 mismatch
>>>>>>>>>> names(txmatches) #gives character vector containing probe names
>>>>>>>>>> names(txmatches1) #returns NULL
>>>>>>>>>
>>>>>>>>> So it appears the names are not inherited. I tried to added them
>>>>>>>>> manually to my MIndex object
>>>>>>>>>> names(txmatches1)<-names(probeset)
>>>>>>>>>
>>>>>>>>> but I get Error:
>>>>>>>>> attempt to modify the names of a ByPos_MIndex instance
>>>>>>>>>
>>>>>>>>> Therefore I'm not sure how to keep my probe names associated with
>>>>>>>>> the
>>>>>>>>> Transcript match, which is important for inexact matching.
>>>>>>>>>
>>>>>>>>> Any help would be greatly appreciated,
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>>
>>>>>>>>> R version 2.13.0 beta (2011-03-31 r55221)
>>>>>>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>> [1] C/UTF-8/C/C/C/C
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>>>
>>>>>>>>> other attached packages:
>>>>>>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>>>>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17
>>>>>>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31
>>>>>>>>> [7] IRanges_1.9.28
>>>>>>>>>
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0
>>>>>>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1
>>>>>>>>> [7] rtracklayer_1.11.12 tools_2.13.0
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Hervé Pagès
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>> Phone: (206) 667-5791
>>>>>>> Fax: (206) 667-1319
>>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list