[BioC] how to modify a Biostrings object

Tue Feb 10 02:27:02 CET 2009

Hi Paul,

There is very little support for modifying sequences (XString objects)
in Biostrings so far. Patrick already mentioned replaceLetterAt() which
is limited to replacing some letters in the original sequence by new
letters. The letters to replace are specified by position, and each
letter to replace is replaced by 1 letter so the length of the sequence
is conserved.

An "in place" version of replaceLetterAt() is in fact what is used behind
the scene when you inject SNPs from a SNPlocs package with injectSNPs()
(?replaceLetterAt gives some details about this).

The SNPs contained in the SNPlocs.Hsapiens.dbSNP.20071016 package are
accessible via 2 functions: getSNPcount() and getSNPlocs(), both
defined in the SNPlocs pkg. The 1st function returns the nb of known
SNPs per chromosome:

   > library(SNPlocs.Hsapiens.dbSNP.20071016)
   > getSNPcount()
     chr1   chr2   chr3   chr4   chr5   chr6   chr7   chr8   chr9  chr10  chr11
   694713 673384 563502 580624 512891 583597 480870 441049 377816 449366 451615
    chr12  chr13  chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22
   423277 330632 265729 242233 256253 224656 237958 190296 253012 118161 149667
     chrX   chrY
   308827   6054

and the 2nd function returns a data frame describing the SNPs for a
given chromosome:

   > chr9snps <- getSNPlocs("chr9")
   > dim(chr9snps)
   [1] 377816      3
   > head(chr9snps)
     RefSNP_id alleles_as_ambig   loc
   1   9407282                R 42963
   2   2260177                M 44415
   3  10217501                S 44444
   4   9299000                R 44551
   5  10217133                K 44553
   6  10217138                K 44692

It contains 1 SNP per row: the 1st column is the RefSNP id assigned by
dbSNP to the SNP, the 2nd col is the IUPAC ambiguity letter describing
the alleles for the SNP, and the 3rd col is its position in the chromosome.

You could for example extract region 9q32 and then inject the SNPs that
belong to this region with:

   > hg18.9q32 <- unmasked(subseq(Hsapiens$chr9, 113900001, 116700000))
   > snp_is_in_range <- 113900001 <= chr9snps$loc & chr9snps$loc <= 116700000
   > hg18.9q32.snps <- chr9snps[snp_is_in_range, ]
   > dim(hg18.9q32.snps)
   [1] 8994    3
   > head(hg18.9q32.snps)
          RefSNP_id alleles_as_ambig       loc
   293622  12350562                M 113900013
   293623  10124666                W 113900343
   293624   3739698                K 113900698
   293625   3739697                Y 113900837
   293626  12377671                Y 113901284
   293627  10119477                R 113901385
   > hg18.9q32.withsnps <- replaceLetterAt(hg18.9q32,
                             hg18.9q32.snps$loc - 113900001 + 1,
                             hg18.9q32.snps$alleles_as_ambig,
                             if.not.extending="merge")

Note that you get the same thing if you inject the SNPs genome-wide
first and then extract region 9q32:

   > hg18.withsnps <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016")
   > hg18.9q32.withsnps2 <- unmasked(subseq(hg18.withsnps$chr9,
                                            113900001, 116700000))
   > hg18.9q32.withsnps2 == hg18.9q32.withsnps
   [1] TRUE

Now if you want to inject SNPs incrementally, one region at a time, in chr9,
you can do:

   injectSNPsInRegion <- function(x, start, end, snps)
   {
      ii <- start <= snps$loc & snps$loc <= end
      snps2 <- snps[ii, ]
      replaceLetterAt(x, snps2$loc - start + 1,
                         snps2$alleles_as_ambig,
                         if.not.extending="merge")
   }

   > chr9 <- unmasked(Hsapiens$chr9)
   > chr9withsnps <- injectSNPsInRegion(chr9, 55001, 95000, chr9snps)

If you put this in a loop, e.g.:

   chr9withsnps <- chr9
   for (...) {
     get the start/end of a gene in chr9 from somewhere
     ...
     chr9withsnps <- injectSNPsInRegion(chr9withsnps, start, end, chr9snps)
     ...
     do some analysis on chr9withsnps
   }

don't try to keep all the intermediate versions of chr9withsnps otherwise
you'll run into memory problems. In the above code the old 'chr9withsnps'
is replaced by the new 'chr9withsnps' at each iteration so the memory used
by the old 'chr9withsnps' can be garbage-collected.

More facilities for modifying XString objects will need to be added to the
Biostrings package if you want to be able to delete or insert regions in
the chromosome. For example I could add a replacement method for subseq()
that could be used like this:

   > subseq(chr9, 201, 300) <- ""  # deletion of region 201-300

   > subseq(chr9, 201, 300) <- "ACCACGTAATG"  # replacement of region 201-300
                                              # by a shorter sequence

   > subseq(chr9, 201, 205) <- "ACCACGTAATG"  # replacement of region 201-205
                                              # by a longer sequence

   > subseq(chr9, 201, 200) <- "ACCACGTAATG"  # insertion of ACCACGTAATG between
                                              # position 200 and 201

The last form looks a little bit strange but is consistent with the general
behaviour of subseq() and with the representation of empty ranges (see ?Ranges).
Let me know if that would be useful.

H.

Paul Shannon wrote:
> I wish to introduce some mutations into a segment of Hg18 chromosome 9, 
> as part of a simulation I wish to run.  I am having difficulty figuring 
> out how to modify the BSgenome-related data structures.
> 
> I create my (wild-type) variables like this:
> 
>   chr9 <- Hsapiens [['chr9']]
>   hg18.9q32 <<- subseq (chr9, 113900001, 116700000)
> 
> How would one create a mutant variant of hg18.9q32?  My wish is to 
> incrementally -- one gene at a time -- introduce SNPs (and later, indels 
> and translocations). Sometimes these changes are intentionally random, 
> sometimes I want them at very specific places, to 'spike in' an 
> interesting mutation.
> 
> I see that one can injectSNPs into a genome:
> 
>    Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016")
> 
> I could find no documentation for the SNPlocs data structure.
> 
> hg18.9q32 appears to be a MaskedDNAString, apparently derived from 
> Biostring and XString, all of which seem to be immutable.
> 
> Is there any way to do this?
> 
> Let me apologize in advance if this is documented and I missed it.  I 
> have combed the vignettes and the help pages and found nothing.
> 
> Many thanks,
> 
>   - Paul
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319