[Bioc-sig-seq] SNP counting (Biostrings ?) question

Wolfgang Raffelsberger wraff at titus.u-strasbg.fr
Thu Oct 23 19:24:48 CEST 2008


Dear list,

I would like to count the occurrence of (mostly single) nucleotide 
polymorphisms from nucleotide sequences.
I got across the Biostrings package and pairwiseAlignment() that allows 
me to get closer to what I want but
1) I noticed that the score produced from pairwiseAlignment() is quite 
different to other implementations of the Needlaman-Wunsch alogorithm 
(eg in EMBOSS)
2) the score is not directly the information I 'm looking for since it's 
a mixture of the gaps & mismatches (and I don't see if/how one could 
modify that).

However, I would primarily be interested in finding where a given 
nucleotide differs from the query (from a pairwise alignment) to some 
statistics on them, ie at which position I get which other element 
instead. Note, that my sample-sequences may start or end slightly 
later/earlier.
Any suggestions ?

Sample code might look like (of course, my real sequences are longer ...):

ref <- DNAString("ACTTCACCAGCTCCCTGGC")
samp <- 
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))   
# the 3rd one has no mutations, it's simply shorter ...
  pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening 
= -5, gapExtension = -2)
alignScores <- numeric()
 for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]], 
substitutionMatrix = mat, gapOpening = -5, gapExtension = -2, scoreOnly=T)
alignScores     # the 3rd sequence without mismatches gets worst score

 
(Based on a previous post on BioC) I just subscribed to 
bioc-sig-sequencing at r-project.org, but I don't know if I don't mange to 
search the previous mail archives (on http://search.gmane.org/) since I 
keep getting (general) Bioconductor messages.

Thank's in advance,
Wolfgang
 

By the way, if that matters, I'm (still) running R-2.7.2
 > sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32

locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252

attached base packages:
[1] stats     graphics  grDevices datasets  tcltk     utils     
methods   base    

other attached packages:
[1] Biostrings_2.8.18 svSocket_0.9-5    svIO_0.9-5        
R2HTML_1.59       svMisc_0.9-5      svIDE_0.9-5     

loaded via a namespace (and not attached):
[1] tools_2.7.2

 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC 
1 rue Laurent Fries,  67404 Illkirch  Strasbourg,  France
Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr



More information about the Bioc-sig-sequencing mailing list