[BioC] matchPDict with mismatches allowed appears to drop names
Hervé Pagès
hpages at fhcrc.org
Fri Aug 5 08:43:25 CEST 2011
Hi Ian,
On 11-08-04 08:11 AM, Ian Henry wrote:
> Hello again,
>
> I'm not sure whether to append this or add a new thread but I have
> another issue with matchPDict.
>
> I can create my PDict object:
> ps_pdict<-PDict(probeset, max.mismatch=max_mis)
>
>
> I then use vwhichPDict to filter for transcripts that match one of my
> probes
> tx_matches<- vwhichPDict(ps_pdict, transcriptLibrary,
> max.mismatch=max_mis, min.mismatch=min_mis)
>
> and then run matchPDict over my list of transcripts that have a probe
> match using various numbers of mismatches for fuzzy matching.
> txmatches<- matchPDict(ps_pdict, transcript[[1]],
> max.mismatch=max_mis, min.mismatch=min_mis)
>
> Exact matches work well, as do 1 and 2 mismatches, but above 3 things
> go a bit wrong and I get back results with only 2 mismatches with
> max.mismatch=3, and min.mismatch=3
>
> Exact Matches:
> "names
> " "start
> " "end
> " "width
> " "TranscriptID
> " "Matched_Sequence
> " "probe_sequence
> " "maxMismatchSet
> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved"
> "HW:11" 169 193
> 25
> "NM_001144941.1
> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0"
> "HW:11" 169 193
> 25
> "NM_001144940.1
> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 "" "0"
> "HW:11" 169 193
> 25
> "NM_182566.2" "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA"
> 0 0 "" "0"
> "HW:11" 169 193
> 25
> "NM_001144939.1
> " "GAACGGCTACACGGCGGTCATCGAA" "GAACGGCTACACGGCGGTCATCGAA" 0 0 ""
> "HW:13" 120 144
> 25
> "NM_007128.2" "GCCCTTGGAACCACAATCCGCCTCA" "GCCCTTGGAACCACAATCCGCCTCA"
> 0 0 "" "0"
>
> One MisMatch:
> "names
> " "start
> " "end
> " "width
> " "TranscriptID
> " "Matched_Sequence
> " "probe_sequence
> " "maxMismatchSet
> " "minMismatchSet" "MisMatch_Position(s)" "NumberOfMismatchesObserved"
> "HW:9" 826 850
> 25
> "NM_198152.3" "CCTAACTTTGTTATCCGTGTTGAGT" "CCTAACTTTGTTATCCGTGTTGATT"
> 1 1 "24" "1"
> "HW:18" 551 575
> 25
> "NR_037144.1" "GACTCTGCTGTGACCTCCTTGTTCA" "GACTCTACTGTGACCTCCTTGTTCA"
> 1 1 "7" "1"
>
> Two MisMatches:
> names start end width TranscriptID Matched_Sequence probe_sequence
> maxMismatchSet minMismatchSet MisMatch_Position(s)
> NumberOfMismatchesObserved
> HW:22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT
> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 12; 17 2
> IV:14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC
> CGCACACACACACACACACACACAT 2 2 19; 25 2
> HW::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT
> GGAGGCCGAGGTAGGAGGATTGCTT 2 2 7; 13 2
> IV:14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC
> CGCACACACACACACACACACACAT 2 2 2; 25 2
> IV::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC
> CGCACACACACACACACACACACAT 2 2 4; 25 2
> IV:14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC
> CGCACACACACACACACACACACAT 2 2 2; 25 2
>
> Three MisMatches:
> names start end width TranscriptID Matched_Sequence probe_sequence
> maxMismatchSet minMismatchSet MisMatch_Position(s)
> NumberOfMismatchesObserved
> HGW::17::7::22 2324 2348 25 XR_115554.1 GGAGGCCGAGGCAGGATGATTGCTT
> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 12; 17 2
> INV::5::14::14 805 829 25 XR_115163.1 CGCACACACACACACACATACACAC
> CGCACACACACACACACACACACAT 3 3 19; 25 2
> HGW::17::7::22 1078 1102 25 XR_115114.1 GGAGGCTGAGGTGGGAGGATTGCTT
> GGAGGCCGAGGTAGGAGGATTGCTT 3 3 7; 13 2
> INV::5::14::14 108 132 25 XR_114935.1 CACACACACACACACACACACACAC
> CGCACACACACACACACACACACAT 3 3 2; 25 2
> INV::5::14::14 102 126 25 XR_114935.1 CGCGCACACACACACACACACACAC
> CGCACACACACACACACACACACAT 3 3 4; 25 2
> INV::5::14::14 106 130 25 XR_114935.1 CACACACACACACACACACACACAC
> CGCACACACACACACACACACACAT 3 3 2; 25 2
Please provide a self-contained reproducible example.
Below I try to run something only *very* approximate to what you
seem to have done (based on the information you provide above),
but I can't reproduce the problem.
library(Biostrings)
## Using the 6 probes and 6 matched sequences from your "Three
## MisMatches" example:
probes <- DNAStringSet(c("GGAGGCCGAGGTAGGAGGATTGCTT",
"CGCACACACACACACACACACACAT",
"GGAGGCCGAGGTAGGAGGATTGCTT",
"CGCACACACACACACACACACACAT",
"CGCACACACACACACACACACACAT",
"CGCACACACACACACACACACACAT"))
matched_sequences <- DNAStringSet(c("GGAGGCCGAGGCAGGATGATTGCTT",
"CGCACACACACACACACATACACAC",
"GGAGGCTGAGGTGGGAGGATTGCTT",
"CACACACACACACACACACACACAC",
"CGCGCACACACACACACACACACAC",
"CACACACACACACACACACACACAC"))
## Computing the Hamming distances:
> sapply(1:6, function(i) neditStartingAt(matched_sequences[[i]],
probes[[i]]))
[1] 2 2 2 2 2 2
## Using matchPDict() with max.mismatch=3 and min.mismatch=3.
ps_pdict <- PDict(probeset, max.mismatch=3)
## probe 1 vs sequence 1:
> m1 <- matchPDict(ps_pdict, matched_sequences[[1]],
max.mismatch=3, min.mismatch=3)
> m1[[1]]
IRanges of length 0
## --> no hit
## probe 2 vs sequence 2:
> m2 <- matchPDict(ps_pdict, matched_sequences[[2]],
max.mismatch=3, min.mismatch=3)
> m2[[2]]
IRanges of length 0
## --> no hit
## probe 3 vs sequence 3:
> m3 <- matchPDict(ps_pdict, matched_sequences[[3]],
max.mismatch=3, min.mismatch=3)
> m3[[3]]
IRanges of length 0
## --> no hit
## probe 4 vs sequence 4:
> m4 <- matchPDict(ps_pdict, matched_sequences[[4]],
max.mismatch=3, min.mismatch=3)
> m4[[4]]
IRanges of length 2
start end width
[1] -1 23 25
[2] 3 27 25
## --> probe 4 hits sequence 4 twice but those 2 hits
## are "out-of-limit" hits. The reason those hits are
## considered to have 3 mismatches is that letters in
## the pattern that are not facing a letter in the subject
## are counted as mismatches:
> neditStartingAt(probes[[4]], matched_sequences[[4]],
starting.at=-2:4)
[1] 25 3 25 2 25 3 25
## probe 5 vs sequence 5:
> m5 <- matchPDict(ps_pdict, matched_sequences[[5]],
max.mismatch=3, min.mismatch=3)
> m5[[5]]
IRanges of length 0
## --> no hit
## probe 6 vs sequence 6:
> m6 <- matchPDict(ps_pdict, matched_sequences[[6]],
max.mismatch=3, min.mismatch=3)
> m6[[6]]
IRanges of length 2
start end width
[1] -1 23 25
[2] 3 27 25
## -> same as probe 4 vs sequence 4
## Consistency with vwhichPDict():
> countIndex(m1)
[1] 0 0 0 0 0 0
> countIndex(m2)
[1] 0 0 0 0 0 0
> countIndex(m3)
[1] 0 0 0 0 0 0
> countIndex(m4)
[1] 0 2 0 2 2 2
> countIndex(m5)
[1] 0 0 0 0 0 0
> countIndex(m6)
[1] 0 2 0 2 2 2
> vwhichPDict(ps_pdict, matched_sequences,
max.mismatch=3, min.mismatch=3)
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
integer(0)
[[4]]
[1] 2 4 5 6
[[5]]
integer(0)
[[6]]
[1] 2 4 5 6
## --> things are consistent
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.20.2 IRanges_1.10.4
loaded via a namespace (and not attached):
[1] tools_2.13.0
H.
>
>
> When running vwhichPDict with three mismatches I do get a warning that
> the algorithm may not be efficient, but this seems to indicate a time
> problem rather than an accuracy problem.
> Warning message:
> In .MTB_PDict(x, max.mismatch, algo) :
> given the characteristics of dictionary 'x', this value of
> 'max.mismatch' will
> give poor performance when you call matchPDict() on this MTB_PDict
> object
> (it will of course depend ultimately on the length of the subject)
>
> Indeed the output looks OK:
>
> Indeed the results of vwhichPDict gives:
> 6428 tx_matches for probes allowing max 0 min 0 mismatches
> 1077 tx_matches for probes allowing max 1 min 1 mismatches
> 1953 tx_matches for probes allowing max 2 min 2 mismatches
> 3729 tx_matches for probes allowing max 3 min 3 mismatches
>
> However, If I compare this to the output of looping over matchPDict
> for transcripts with probe matches (from vwhichPDict) I get:
> 6428 unique tx_matches for probes allowing max 0 min 0 mismatches
> 1077 unique tx_matches for probes allowing max 1 min 1 mismatches
> 1953 unique tx_matches for probes allowing max 2 min 2 mismatches
> 1953 unique tx_matches for probes allowing max 3 min 3 mismatches but
> the results show only 2 mismatches
>
> Suggesting that matchPDict reverted back to 2 mismatch settings
> (max.mismatch=2, min.mismatch=2 )in the case of max.mismatch=3,
> min.mismatch=3. (It also seems to be the case for max=3, min=0)
>
> Probably 3 mismatches is excessive for what I want to do, but I
> thought I'd report the issue. It seems like matchPDict is reverting
> back to 2 mismatches in this case but vwhichPDict probably does 3.
>
> Thanks,
>
> Ian
>
>
>
> On Aug 3, 2011, at 11:06 AM, Ian Henry wrote:
>
>> Excellent thanks for the help and the fix!
>>
>> The workaround works well for me for now. I did spend a while
>> yesterday writing my own workaround but Harris' workaround is much
>> simpler for now than my solution, which is given below but is rather
>> dirty.
>>
>>> txmatches<- matchPDict(ps_pdict, transcript[[1]],
>> max.mismatch=max_mis, min.mismatch=min_mis)
>>> tdf<-as.data.frame(unlist(txmatches))
>>> patternIndex<-which(countIndex(txmatches)>0)
>>> duptimes<- function(duptimes)
>> rep(duptimes,length(txmatches[[duptimes]]))
>>> tmp<-sapply(patternIndex,duptimes)
>>> tdf$patternIndex<-unlist(tmp)
>>> getNames<- function(getNames) names(ps_pdict)[getNames]
>>> tdf$name<-sapply(tdf$patternIndex, getNames)
>>
>>
>>
>> On Aug 3, 2011, at 3:04 AM, Hervé Pagès wrote:
>>
>>> Hi Ian, Harris,
>>>
>>> On 11-08-02 08:45 AM, Harris A. Jaffee wrote:
>>>> The short answer for the fuzzy matching case appears to be that,
>>>> even
>>>> though your names are maintained through most of the function
>>>>
>>>> matchPDict.R:.match.MTB_PDict,
>>>>
>>>> and in fact they are still present in the list 'ans_compons', they
>>>> get
>>>> dropped in this call:
>>>>
>>>> ans<- ByPos_MIndex.combine(ans_compons)
>>>>
>>>> By contrast, for exact matching, the function
>>>>
>>>> matchPDict.R:.match.TB_PDict
>>>>
>>>> returns your names here:
>>>>
>>>> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict),
>>>> ends=C_ans)
>>>>
>>>> So my naive guess is that
>>>>
>>>> MIndex-class.R:ByPos_MIndex.combine
>>>>
>>>> should do that also, something like
>>>>
>>>> ans_names<- mi_list[[1]]@NAMES
>>>> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends,
>>>> NAMES=ans_names)
>>>
>>> Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and
>>> will be soon in Biostrings devel. With Biostrings 2.20.2:
>>>
>>> library(Biostrings)
>>> probes<- DNAStringSet(c(probe1="aaaaaa", probe2="accacc"))
>>> pdict1<- PDict(probes, max.mismatch=1)
>>> m1<- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1)
>>>
>>> Then:
>>>
>>> > names(m1)
>>> [1] "probe1" "probe2"
>>>
>>> > unlist(m1)
>>> IRanges of length 3
>>> start end width names
>>> [1] 2 7 6 probe1
>>> [2] 3 8 6 probe1
>>> [3] 7 12 6 probe2
>>>
>>> Thanks guys for the report and for the fix!
>>> H.
>>>
>>>>
>>>> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question regarding the inheritance of the names attribute
>>>>> when using matchPDict.
>>>>>
>>>>> If I use matchPDict as follows:
>>>>>
>>>>> #Get transcript information
>>>>>> hg19txdb<- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>>>>> "refGene")
>>>>>> hg19_tx<- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>>>>
>>>>> #Create DNAStringSet with names associated with each probe
>>>>>> probeset<- DNAStringSet(probelist$sequence)
>>>>>> names(probeset)<-probelist$probenames
>>>>>
>>>>> #Create PDict object and match against human transcript 14 (I
>>>>> know it
>>>>> should match)
>>>>>> ps_pdict<-PDict(probeset)
>>>>>> txmatches<- matchPDict(ps_pdict, hg19_tx[[14]])
>>>>>
>>>>> this compares the probes in ps_pdict to transcript 14 in hg19 and
>>>>> gives:
>>>>>> unlist(txmatches):
>>>>>
>>>>> start end width names
>>>>> [1] 749 773 25 HW:6
>>>>> [2] 569 593 25 HW:16
>>>>> [3] 804 828 25 HW:26
>>>>> [4] 757 781 25 HW:36
>>>>>
>>>>> which works :)
>>>>>
>>>>> However, if I search allowing for mismatches then the names
>>>>> appear to
>>>>> be lost:
>>>>>
>>>>>> ps_pdict1<-PDict(probeset, max.mismatch=1)
>>>>>> txmatches1<- matchPDict(ps_pdict1, hg19_tx[[14]],
>>>>> max.mismatch=1,
>>>>> min.mismatch=0)
>>>>>> unlist(txmatches1)
>>>>>
>>>>> IRanges of length 4
>>>>> start end width
>>>>> [1] 749 773 25
>>>>> [2] 569 593 25
>>>>> [3] 804 828 25
>>>>> [4] 757 781 25
>>>>>
>>>>> The result of matchPDict is a MIndex object that I named txmatches
>>>>> with exact matches, and txmatches1 with 1 mismatch
>>>>>> names(txmatches) #gives character vector containing probe names
>>>>>> names(txmatches1) #returns NULL
>>>>>
>>>>> So it appears the names are not inherited. I tried to added them
>>>>> manually to my MIndex object
>>>>>> names(txmatches1)<-names(probeset)
>>>>>
>>>>> but I get Error:
>>>>> attempt to modify the names of a ByPos_MIndex instance
>>>>>
>>>>> Therefore I'm not sure how to keep my probe names associated with
>>>>> the
>>>>> Transcript match, which is important for inexact matching.
>>>>>
>>>>> Any help would be greatly appreciated,
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ian
>>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 2.13.0 beta (2011-03-31 r55221)
>>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>>>
>>>>> locale:
>>>>> [1] C/UTF-8/C/C/C/C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>>> [3] BSgenome_1.19.5 Biostrings_2.19.17
>>>>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31
>>>>> [7] IRanges_1.9.28
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0
>>>>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1
>>>>> [7] rtracklayer_1.11.12 tools_2.13.0
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>
>
>
> [[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list