[BioC] insert Ns for repeat masked regions

Pages, Herve hpages at fhcrc.org
Wed Mar 2 06:39:59 CET 2011


Hi Pete,

The getSeq() function always drops the masks. See the man page:

     Note that the masks in ‘x’, if any, are always ignored. In other
     words, masked regions in the genome are extracted in the same way
     as unmasked regions (this is achieved by dropping the masks before
     extraction).

If you want to extract sequences with their masks on, you will need to
extract them one at a time by: (1) loading the corresponding chromosome,
(2) activate/deactivate the masks you want, (3) use subseq():

  > library(BSgenome.Hsapiens.UCSC.hg19)
  > chr1 <- Hsapiens$chr1  # this loads the sequence in memory
  > active(masks(chr1))
  AGAPS   AMB    RM   TRF 
   TRUE  TRUE FALSE FALSE 
  > active(masks(chr1))["RM"] <- TRUE  # activate the RepeatMasker mask
  > active(masks(chr1))
  AGAPS   AMB    RM   TRF 
   TRUE  TRUE  TRUE FALSE 
  > myseq <- subseq(chr1, start=530000, end=530999)
  > myseq
    1000-letter "MaskedDNAString" instance (# for masking)
  seq: ##############################CCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC
  masks:
    maskedwidth maskedratio active names                               desc
  1           0       0.000   TRUE AGAPS                      assembly gaps
  2           0       0.000   TRUE   AMB   intra-contig ambiguities (empty)
  3         594       0.594   TRUE    RM                       RepeatMasker
  4           0       0.000  FALSE   TRF Tandem Repeats Finder [period<=12]
  all masks together:
    maskedwidth maskedratio
            594       0.594
  all active masks together:
    maskedwidth maskedratio
            594       0.594
  > unmasked(myseq)
    1000-letter "DNAString" instance
  seq: TAAAAGTACCCTGTCAAAGGGTGAGCATTTCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC

Then you can use injectHardMask() on 'myseq' to replace the masked regions
with Ns:

  > injectHardMask(myseq, "N")
    1000-letter "DNAString" instance
  seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAGCCAAGACGCTGC

If you need to extract sequences from many different chromosomes you'll
need to pay attention to memory management. Ideally you want to load one
chromosome at a time (this is a costly operation), and, when you are done
with it, you want to make it a candidate for garbage collection by removing
explicitly any object holding a reference to it. In the above example,
both 'chr1' and 'myseq' are holding a reference to the chr1 sequence that
was loaded in memory (in the case of 'myseq' this is because subseq()
doesn't copy the sequence data) so they both need to be removed with
'rm(chr1)' and 'rm(myseq)'. Keeping the result of injectHardMask() around
is ok because injectHardMask() does copy the sequence data.

See the PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE subsection
in the examples section of the man page for BSgenome objects for more
information and tips about this.

Cheers,
H.

----- Original Message -----
From: "rna seq" <rna.seeker at gmail.com>
To: "BioC List" <bioconductor at stat.math.ethz.ch>
Sent: Tuesday, March 1, 2011 2:50:39 PM
Subject: [BioC] insert Ns for repeat masked regions

Hello List,

I am trying to retrieve a sequence of ~1000 nts using the getseq() function
from the BSgenomes package

I would like to replace the repeat masked regions with Ns

using something similar to the inject snps function from the
SNPlocs.Hsapiens.dbSNP package.

So far I can grab sequence from the genome either masked: getSeq(Hsapiens,
"chr21",  33665196, 33665435,  as.character=FALSE)

or unmasked: getSeq(hg19snp, "chr21",  33665196, 33665435,
as.character=FALSE)

The problem is that the masked function returns a gap:

TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTACGGAAA

 and I would like the gap to be filled with Ns?

Thanks in advance,

Pete

	[[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list