[BioC] insert Ns for repeat masked regions
Pages, Herve
hpages at fhcrc.org
Wed Mar 2 06:39:59 CET 2011
Hi Pete,
The getSeq() function always drops the masks. See the man page:
Note that the masks in ‘x’, if any, are always ignored. In other
words, masked regions in the genome are extracted in the same way
as unmasked regions (this is achieved by dropping the masks before
extraction).
If you want to extract sequences with their masks on, you will need to
extract them one at a time by: (1) loading the corresponding chromosome,
(2) activate/deactivate the masks you want, (3) use subseq():
> library(BSgenome.Hsapiens.UCSC.hg19)
> chr1 <- Hsapiens$chr1 # this loads the sequence in memory
> active(masks(chr1))
AGAPS AMB RM TRF
TRUE TRUE FALSE FALSE
> active(masks(chr1))["RM"] <- TRUE # activate the RepeatMasker mask
> active(masks(chr1))
AGAPS AMB RM TRF
TRUE TRUE TRUE FALSE
> myseq <- subseq(chr1, start=530000, end=530999)
> myseq
1000-letter "MaskedDNAString" instance (# for masking)
seq: ##############################CCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC
masks:
maskedwidth maskedratio active names desc
1 0 0.000 TRUE AGAPS assembly gaps
2 0 0.000 TRUE AMB intra-contig ambiguities (empty)
3 594 0.594 TRUE RM RepeatMasker
4 0 0.000 FALSE TRF Tandem Repeats Finder [period<=12]
all masks together:
maskedwidth maskedratio
594 0.594
all active masks together:
maskedwidth maskedratio
594 0.594
> unmasked(myseq)
1000-letter "DNAString" instance
seq: TAAAAGTACCCTGTCAAAGGGTGAGCATTTCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC
Then you can use injectHardMask() on 'myseq' to replace the masked regions
with Ns:
> injectHardMask(myseq, "N")
1000-letter "DNAString" instance
seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC
If you need to extract sequences from many different chromosomes you'll
need to pay attention to memory management. Ideally you want to load one
chromosome at a time (this is a costly operation), and, when you are done
with it, you want to make it a candidate for garbage collection by removing
explicitly any object holding a reference to it. In the above example,
both 'chr1' and 'myseq' are holding a reference to the chr1 sequence that
was loaded in memory (in the case of 'myseq' this is because subseq()
doesn't copy the sequence data) so they both need to be removed with
'rm(chr1)' and 'rm(myseq)'. Keeping the result of injectHardMask() around
is ok because injectHardMask() does copy the sequence data.
See the PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE subsection
in the examples section of the man page for BSgenome objects for more
information and tips about this.
Cheers,
H.
----- Original Message -----
From: "rna seq" <rna.seeker at gmail.com>
To: "BioC List" <bioconductor at stat.math.ethz.ch>
Sent: Tuesday, March 1, 2011 2:50:39 PM
Subject: [BioC] insert Ns for repeat masked regions
Hello List,
I am trying to retrieve a sequence of ~1000 nts using the getseq() function
from the BSgenomes package
I would like to replace the repeat masked regions with Ns
using something similar to the inject snps function from the
SNPlocs.Hsapiens.dbSNP package.
So far I can grab sequence from the genome either masked: getSeq(Hsapiens,
"chr21", 33665196, 33665435, as.character=FALSE)
or unmasked: getSeq(hg19snp, "chr21", 33665196, 33665435,
as.character=FALSE)
The problem is that the masked function returns a gap:
TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTACGGAAA
and I would like the gap to be filled with Ns?
Thanks in advance,
Pete
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list