[BioC] HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"

Cook, Malcolm MEC at stowers.org
Fri Sep 16 22:36:27 CEST 2011


Hi, fellow bioconductors,

I offer the following function as an implementation of a solution to the problem presented in

	[BioC] insert Ns for repeat masked regions - https://stat.ethz.ch/pipermail/bioconductor/2011-March/038134.html
	[Bioc-sig-seq] Extract masked sequences - https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-January/000786.html 

I have cc:ed the original discussants on those two threads.

I embrace any criticisms or suggestions for improvement, and I hope it helps any colleague faced with the same issue.

Cheers,

Malcolm Cook


getSeqHardMasked <-
  function(BSg,GR,maskList,letter) {
### PURPOSE: return list of DNAString sequences extracted from the
### BSgenome <BSg> corresponding to each location in GenomicRange
### <GR>, and masked with <letter> according to the masks named in
### <maskList> (which are encoded following BSParams convention).
###
### USE CASE - write fasta file of hard masked regions, using
###            RepeatMasker (RM) and Tandem Repeat Finder (TRF):
###
### GR <- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
###            ,"myExtractForGR.fa"
###            ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
###            )
###
### NB: The implementation was coded 'pay(ing) attention to memory
### management' following suggestions from Herve in:
### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
### In particular, the inidividual chromosomes and their
### subseq(uences) should be garbage collectable after the function
### exits and they go out of scope, (although the chromosomes _are_
### all simultaneously loaded which I think is unavoidable if the
### results are to preserve the arbitrary order of GR).
###
### NB: My initial implementation FAILed as it used bsapply & BSParams
### whose FUN can not 'know' the name of the sequence (which was
### needed to know which subseqs to extract).
    ']]' <-
      ## utility to subscript b by a.
      function(a,b) b[[a]]
    Vsubseq <-
      ## vectorized version of subseq.
      Vectorize(subseq)
    VinjectHardMask <-
      ## vectorized version of injectHardMask.
      Vectorize(injectHardMask)
    activeMask <-
      ## A logical vector indicating whether each mask should be ON or
      ## OFF to be applied to each chromosome in BSg.
      masknames(BSg) %in% names(maskList[which(maskList)])
    BSg_seqList <-
      ## BSg as a list of named MaskedDNAString (one per chromosome)...
      sapply(seqnames(BSg),']]',BSg)
    BSg_seqList <-
      ## ... with the masks for each chromosome activated.
      sapply(BSg_seqList,function(x) {active(masks(x)) <- activeMask;x})
    GR_seq <-
      ## the MaskedDNAString corresponding to GR
      sapply(as.character(seqnames(GR)),']]',BSg_seqList)
    VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
}



More information about the Bioconductor mailing list