[BioC] getting masked sequence from BSgenome: can I use Views or getSeq?
Cook, Malcolm
MEC at stowers.org
Fri Nov 9 15:26:27 CET 2012
> Hi there,
>
> I have some ranges of interest in the human genome, specified as a GRanges object (a bunch of them, in a list), and I'd like to use
> alphabetFrequency on their RepeatMasked sequences. I've found an inefficient way to do this, but I wondered
> (a) whether I'm missing a better way to do it, or
> (b) whether it'd be possible for you to implement some version of Views that returns masked sequences, rather than dropping the
> masks
Hi Janet - perhaps this will help:
getSeqHardMasked <- function(BSg,GR,maskList=NULL,letter='N') {
### PURPOSE: return list of DNAString sequences extracted from the
### BSgenome <BSg> corresponding to each location in GenomicRange
### <GR>, and masked with <letter> according to the masks named in
### <maskList> (which are encoded following BSParams convention).
###
### USE CASE - write fasta file of hard masked regions, using
### RepeatMasker (RM) and Tandem Repeat Finder (TRF):
###
### GR <- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
### ,"myExtractForGR.fa"
### ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
### )
###
### AUTHOR: malcolm_cook at stowers.org
###
### NB: The implementation was coded 'pay(ing) attention to memory
### management' following suggestions from Herve in:
### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
### In particular, the inidividual chromosomes and their
### subseq(uences) should be garbage collectable after the function
### exits and they go out of scope, (although the chromosomes _are_
### all simultaneously loaded which I think is unavoidable if the
### results are to preserve the arbitrary order of GR).
###
### NB: My initial implementation FAILed as it used bsapply & BSParams
### whose FUN can not 'know' the name of the sequence (which was
### needed to know which subseqs to extract).
###
### TODO: factor out the creation of masked BSGenome seqlist for use
### when this needs to be done to multiple sets of GRanges for the
### same genome.
###
### TODO? - provide options to: filter out COMPLETELY masked regions?
### trim ends that are masked?
']]' <-
## utility to subscript b by a.
function(a,b) b[[a]]
Vsubseq <-
## vectorized version of subseq.
Vectorize(subseq)
VinjectHardMask <-
## vectorized version of injectHardMask.
Vectorize(injectHardMask)
activeMask <-
## A logical vector indicating whether each mask should be ON or
## OFF to be applied to each chromosome in BSg.
masknames(BSg) %in% names(maskList[which(maskList)])
BSg_seqList <-
## BSg as a list of named MaskedDNAString (one per chromosome)...
sapply(seqnames(BSg),']]',BSg)
BSg_seqList <-
## ... with the masks for each chromosome activated.
sapply(BSg_seqList,function(x) {active(masks(x)) <- activeMask;x})
GR_seq <-
## the MaskedDNAString corresponding to GR
sapply(as.character(seqnames(GR)),']]',BSg_seqList)
VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
}
>
> My inefficient (in memory) version is to create a hard-masked version of the genome like this:
>
> allMaskedSeqsHsapiens <- list()
> for (thisChr in seqnames(Hsapiens)) {
> cat("chr",thisChr,"\n")
> myChr <- Hsapiens [[thisChr]]
> active(masks(myChr)) <- rep(TRUE, 4)
> myChr <- injectHardMask(myChr)
> allMaskedSeqsHsapiens[[thisChr]] <- myChr
> }
>
> and then for each of the GRanges objects I'm looking at (each has ranges only on a single chromosome), I get the sequence using
> Views on the relevant hard-masked sequence from allMaskedSeqsHsapiens[[thisChr]].
>
> It seems I shouldn't have to create the hard-masked genome list - is there a direct way to do this? Views always seems to drop the
> masks, and I don't think getSeq works for MaskedDNAString objects. Maybe I should be doing some roundabout coercion to get the
> MaskedDNAString chromosome into a DNAString format, retaining the masks? But perhaps it would be nice to have be able to use
> Views or getSeq directly to get masked sequences.
>
> Motivation: I want to calculating repeat-masked intronic GC content for each human gene.
>
> thanks very much,
>
> Janet
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list