[BioC] HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"
Cook, Malcolm
MEC at stowers.org
Fri Sep 16 22:36:27 CEST 2011
Hi, fellow bioconductors,
I offer the following function as an implementation of a solution to the problem presented in
[BioC] insert Ns for repeat masked regions - https://stat.ethz.ch/pipermail/bioconductor/2011-March/038134.html
[Bioc-sig-seq] Extract masked sequences - https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-January/000786.html
I have cc:ed the original discussants on those two threads.
I embrace any criticisms or suggestions for improvement, and I hope it helps any colleague faced with the same issue.
Cheers,
Malcolm Cook
getSeqHardMasked <-
function(BSg,GR,maskList,letter) {
### PURPOSE: return list of DNAString sequences extracted from the
### BSgenome <BSg> corresponding to each location in GenomicRange
### <GR>, and masked with <letter> according to the masks named in
### <maskList> (which are encoded following BSParams convention).
###
### USE CASE - write fasta file of hard masked regions, using
### RepeatMasker (RM) and Tandem Repeat Finder (TRF):
###
### GR <- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
### ,"myExtractForGR.fa"
### ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
### )
###
### NB: The implementation was coded 'pay(ing) attention to memory
### management' following suggestions from Herve in:
### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
### In particular, the inidividual chromosomes and their
### subseq(uences) should be garbage collectable after the function
### exits and they go out of scope, (although the chromosomes _are_
### all simultaneously loaded which I think is unavoidable if the
### results are to preserve the arbitrary order of GR).
###
### NB: My initial implementation FAILed as it used bsapply & BSParams
### whose FUN can not 'know' the name of the sequence (which was
### needed to know which subseqs to extract).
']]' <-
## utility to subscript b by a.
function(a,b) b[[a]]
Vsubseq <-
## vectorized version of subseq.
Vectorize(subseq)
VinjectHardMask <-
## vectorized version of injectHardMask.
Vectorize(injectHardMask)
activeMask <-
## A logical vector indicating whether each mask should be ON or
## OFF to be applied to each chromosome in BSg.
masknames(BSg) %in% names(maskList[which(maskList)])
BSg_seqList <-
## BSg as a list of named MaskedDNAString (one per chromosome)...
sapply(seqnames(BSg),']]',BSg)
BSg_seqList <-
## ... with the masks for each chromosome activated.
sapply(BSg_seqList,function(x) {active(masks(x)) <- activeMask;x})
GR_seq <-
## the MaskedDNAString corresponding to GR
sapply(as.character(seqnames(GR)),']]',BSg_seqList)
VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
}
More information about the Bioconductor
mailing list