[BioC] pairwiseAlignment of PDB files to canonical protein structure

Sun Jun 3 21:15:19 CEST 2012

Hi Everyone,

I am new to this list so please forgive me if I miss something. Over the 
past few weeks, I have been attempting to match the positions provided 
by the PDB to the canonical protein structure. For instance, if a pdb 
file puts a CA Leucine residue at position 5, that does not mean that 
position 5 in the canonical protein structure (as shown by uniprot or 
other databases) is a Leucine. That is because the PDB numbering is 
different. Using CIF files from the PDB database I am more or less able 
to reconstruct the canonical numbering for about 70% of all files.

However, I would like to also align the residues I pull from the CIF 
file with the canonical structure for the structures that my algorithm 
fails to process. To do this, I am using the pairwiseAlignment function 
in the Biostrings package. This function seems to work very well, 
however, I am new to alignment and am thus wondering what are the best 
parameters to use for my problem?

Suppose I have the canonical protein sequence in "canonical.protein" and 
the cif sequnce that I pull from the PDB database in 
"protein.extracted". I then run "pairwiseAlignment(pattern = 
canonical.protein, subject=protein.extracted)", and use the default 
settings for the other parameters. If someone has done something 
similar, can they point me if there parameters that are optimal? 
Especially for things like gapOpening, gapExtension, etc...

Thank you for your help,
Greg