[BioC] how to modify a Biostrings object
Hervé Pagès
hpages at fhcrc.org
Tue Feb 10 02:27:02 CET 2009
Hi Paul,
There is very little support for modifying sequences (XString objects)
in Biostrings so far. Patrick already mentioned replaceLetterAt() which
is limited to replacing some letters in the original sequence by new
letters. The letters to replace are specified by position, and each
letter to replace is replaced by 1 letter so the length of the sequence
is conserved.
An "in place" version of replaceLetterAt() is in fact what is used behind
the scene when you inject SNPs from a SNPlocs package with injectSNPs()
(?replaceLetterAt gives some details about this).
The SNPs contained in the SNPlocs.Hsapiens.dbSNP.20071016 package are
accessible via 2 functions: getSNPcount() and getSNPlocs(), both
defined in the SNPlocs pkg. The 1st function returns the nb of known
SNPs per chromosome:
> library(SNPlocs.Hsapiens.dbSNP.20071016)
> getSNPcount()
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
694713 673384 563502 580624 512891 583597 480870 441049 377816 449366 451615
chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
423277 330632 265729 242233 256253 224656 237958 190296 253012 118161 149667
chrX chrY
308827 6054
and the 2nd function returns a data frame describing the SNPs for a
given chromosome:
> chr9snps <- getSNPlocs("chr9")
> dim(chr9snps)
[1] 377816 3
> head(chr9snps)
RefSNP_id alleles_as_ambig loc
1 9407282 R 42963
2 2260177 M 44415
3 10217501 S 44444
4 9299000 R 44551
5 10217133 K 44553
6 10217138 K 44692
It contains 1 SNP per row: the 1st column is the RefSNP id assigned by
dbSNP to the SNP, the 2nd col is the IUPAC ambiguity letter describing
the alleles for the SNP, and the 3rd col is its position in the chromosome.
You could for example extract region 9q32 and then inject the SNPs that
belong to this region with:
> hg18.9q32 <- unmasked(subseq(Hsapiens$chr9, 113900001, 116700000))
> snp_is_in_range <- 113900001 <= chr9snps$loc & chr9snps$loc <= 116700000
> hg18.9q32.snps <- chr9snps[snp_is_in_range, ]
> dim(hg18.9q32.snps)
[1] 8994 3
> head(hg18.9q32.snps)
RefSNP_id alleles_as_ambig loc
293622 12350562 M 113900013
293623 10124666 W 113900343
293624 3739698 K 113900698
293625 3739697 Y 113900837
293626 12377671 Y 113901284
293627 10119477 R 113901385
> hg18.9q32.withsnps <- replaceLetterAt(hg18.9q32,
hg18.9q32.snps$loc - 113900001 + 1,
hg18.9q32.snps$alleles_as_ambig,
if.not.extending="merge")
Note that you get the same thing if you inject the SNPs genome-wide
first and then extract region 9q32:
> hg18.withsnps <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016")
> hg18.9q32.withsnps2 <- unmasked(subseq(hg18.withsnps$chr9,
113900001, 116700000))
> hg18.9q32.withsnps2 == hg18.9q32.withsnps
[1] TRUE
Now if you want to inject SNPs incrementally, one region at a time, in chr9,
you can do:
injectSNPsInRegion <- function(x, start, end, snps)
{
ii <- start <= snps$loc & snps$loc <= end
snps2 <- snps[ii, ]
replaceLetterAt(x, snps2$loc - start + 1,
snps2$alleles_as_ambig,
if.not.extending="merge")
}
> chr9 <- unmasked(Hsapiens$chr9)
> chr9withsnps <- injectSNPsInRegion(chr9, 55001, 95000, chr9snps)
If you put this in a loop, e.g.:
chr9withsnps <- chr9
for (...) {
get the start/end of a gene in chr9 from somewhere
...
chr9withsnps <- injectSNPsInRegion(chr9withsnps, start, end, chr9snps)
...
do some analysis on chr9withsnps
}
don't try to keep all the intermediate versions of chr9withsnps otherwise
you'll run into memory problems. In the above code the old 'chr9withsnps'
is replaced by the new 'chr9withsnps' at each iteration so the memory used
by the old 'chr9withsnps' can be garbage-collected.
More facilities for modifying XString objects will need to be added to the
Biostrings package if you want to be able to delete or insert regions in
the chromosome. For example I could add a replacement method for subseq()
that could be used like this:
> subseq(chr9, 201, 300) <- "" # deletion of region 201-300
> subseq(chr9, 201, 300) <- "ACCACGTAATG" # replacement of region 201-300
# by a shorter sequence
> subseq(chr9, 201, 205) <- "ACCACGTAATG" # replacement of region 201-205
# by a longer sequence
> subseq(chr9, 201, 200) <- "ACCACGTAATG" # insertion of ACCACGTAATG between
# position 200 and 201
The last form looks a little bit strange but is consistent with the general
behaviour of subseq() and with the representation of empty ranges (see ?Ranges).
Let me know if that would be useful.
H.
Paul Shannon wrote:
> I wish to introduce some mutations into a segment of Hg18 chromosome 9,
> as part of a simulation I wish to run. I am having difficulty figuring
> out how to modify the BSgenome-related data structures.
>
> I create my (wild-type) variables like this:
>
> chr9 <- Hsapiens [['chr9']]
> hg18.9q32 <<- subseq (chr9, 113900001, 116700000)
>
> How would one create a mutant variant of hg18.9q32? My wish is to
> incrementally -- one gene at a time -- introduce SNPs (and later, indels
> and translocations). Sometimes these changes are intentionally random,
> sometimes I want them at very specific places, to 'spike in' an
> interesting mutation.
>
> I see that one can injectSNPs into a genome:
>
> Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016")
>
> I could find no documentation for the SNPlocs data structure.
>
> hg18.9q32 appears to be a MaskedDNAString, apparently derived from
> Biostring and XString, all of which seem to be immutable.
>
> Is there any way to do this?
>
> Let me apologize in advance if this is documented and I missed it. I
> have combed the vignettes and the help pages and found nothing.
>
> Many thanks,
>
> - Paul
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list