[BioC] getting masked sequence from BSgenome: can I use Views or getSeq?

Cook, Malcolm MEC at stowers.org
Fri Nov 9 15:26:27 CET 2012


> Hi there,
> 
> I have some ranges of interest in the human genome, specified as a GRanges object (a bunch of them, in a list), and I'd like to use
> alphabetFrequency on their RepeatMasked sequences. I've found an inefficient way to do this, but I wondered
> (a) whether I'm missing a better way to do it, or
> (b) whether it'd be possible for you to implement some version of Views that returns masked sequences, rather than dropping the
> masks

Hi Janet - perhaps this will help:


getSeqHardMasked <-  function(BSg,GR,maskList=NULL,letter='N') {
### PURPOSE: return list of DNAString sequences extracted from the
### BSgenome <BSg> corresponding to each location in GenomicRange
### <GR>, and masked with <letter> according to the masks named in
### <maskList> (which are encoded following BSParams convention).
###
### USE CASE - write fasta file of hard masked regions, using
###            RepeatMasker (RM) and Tandem Repeat Finder (TRF):
###
### GR <- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
###            ,"myExtractForGR.fa"
###            ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
###            )
###
### AUTHOR: malcolm_cook at stowers.org
###
### NB: The implementation was coded 'pay(ing) attention to memory
### management' following suggestions from Herve in:
### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
### In particular, the inidividual chromosomes and their
### subseq(uences) should be garbage collectable after the function
### exits and they go out of scope, (although the chromosomes _are_
### all simultaneously loaded which I think is unavoidable if the
### results are to preserve the arbitrary order of GR).
###
### NB: My initial implementation FAILed as it used bsapply & BSParams
### whose FUN can not 'know' the name of the sequence (which was
### needed to know which subseqs to extract).
###
### TODO: factor out the creation of masked BSGenome seqlist for use
### when this needs to be done to multiple sets of GRanges for the
### same genome.
###
### TODO? - provide options to: filter out COMPLETELY masked regions?
### trim ends that are masked?
    ']]' <-
      ## utility to subscript b by a.
      function(a,b) b[[a]]
    Vsubseq <-
      ## vectorized version of subseq.
      Vectorize(subseq)
    VinjectHardMask <-
      ## vectorized version of injectHardMask.
      Vectorize(injectHardMask)
    activeMask <-
      ## A logical vector indicating whether each mask should be ON or
      ## OFF to be applied to each chromosome in BSg.
      masknames(BSg) %in% names(maskList[which(maskList)])
    BSg_seqList <-
      ## BSg as a list of named MaskedDNAString (one per chromosome)...
      sapply(seqnames(BSg),']]',BSg)
    BSg_seqList <-
      ## ... with the masks for each chromosome activated.
      sapply(BSg_seqList,function(x) {active(masks(x)) <- activeMask;x})
    GR_seq <-
      ## the MaskedDNAString corresponding to GR
      sapply(as.character(seqnames(GR)),']]',BSg_seqList)
    VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
}

> 
> My inefficient (in memory) version is to create a hard-masked version of the genome like this:
> 
> allMaskedSeqsHsapiens <- list()
> for (thisChr in seqnames(Hsapiens)) {
>     cat("chr",thisChr,"\n")
>     myChr <- Hsapiens [[thisChr]]
>     active(masks(myChr)) <- rep(TRUE, 4)
>     myChr <- injectHardMask(myChr)
>     allMaskedSeqsHsapiens[[thisChr]] <- myChr
> }
> 
> and then for each of the GRanges objects I'm looking at (each has ranges only on a single chromosome), I get the sequence using
> Views on the relevant hard-masked sequence from allMaskedSeqsHsapiens[[thisChr]].
> 
> It seems I shouldn't have to create the hard-masked genome list - is there a direct way to do this?  Views always seems to drop the
> masks, and I don't think getSeq works for MaskedDNAString objects.  Maybe I should be doing some roundabout coercion to get the
> MaskedDNAString chromosome into a DNAString format, retaining the masks?  But perhaps it would be nice to have be able to use
> Views or getSeq directly to get masked sequences.
> 
> Motivation: I want to calculating repeat-masked intronic GC content for each human gene.
> 
> thanks very much,
> 
> Janet
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list