[Bioc-sig-seq] making target from fasta file

Wed Jun 4 04:36:47 CEST 2008

Joseph Dhahbi, P.h.D. wrote:
> 
> I downloaded RepeatMasker from the Table Browser:
> http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Thanks! So I downloaded it and loaded in R with:

   > allrepeats <- read.DNAStringSet("dm3rm", format="fasta")
   Read 1126586 items
   > allrepeats
     A DNAStringSet instance of length 154993
            width seq                                          names
        [1]   433 AATTCGCGTCCGCTTACCCAT...GTTATTCACGGGGTCGGGGA dm3_rmsk_NINJA_I ...
        [2]   177 TGTCGCGGATCGAACGGTGCA...GCGGTAAGATAAGATTAAAT dm3_rmsk_NINJA_LT...
        [3]  1086 ATACGATGGCTGTACCTCATG...CATGGCGCTTTCCATCAGGA dm3_rmsk_Baggins1...
        [4]    62 TCTGCATAGTCTCCGTGTCCA...GGTGCAGCATAAACACTTCT dm3_rmsk_DMLTR5 r...
        [5]   346 AAATAATCCTGGAGGAGGCAA...GTCCTTCAGGACTCTATACT dm3_rmsk_Gypsy_I ...
        [6]   243 CCTGGCTCAGTTATTCCGAGA...GAGTATAGAGTCCTGAAGGA dm3_rmsk_DMRT1C r...
        [7]   149 CGGATGCTATAGGCCGCACCC...CTCAGTTATTCCGAGAGCAC dm3_rmsk_DMRT1C r...
        [8]   230 AACCAGAATCACTGCACGGCT...GCAGATGGTGGGTGTACAGC dm3_rmsk_DMRT1C r...
        [9]   210 CAGCAGCAGTGGTGGCTATCA...CATAGCGGGCCACTGTAGAA dm3_rmsk_DMRT1C r...
        ...   ... ...
   [154985]   164 GAGATGAGATGAGAAGAGAAG...ATAGAATAGAAGAGAATAGA dm3_rmsk_(GAGAA)n...
   [154986]   169 AGAAGAGAAGAGAAGAGAAGA...GATAAGAGAATAGATGAGAA dm3_rmsk_(GAGAA)n...
   [154987]   157 CAACACAACACAACACAACAC...ACACAACATTTCACAACACA dm3_rmsk_(CACAA)n...
   [154988]   153 AGAGAGCAGAGAGAAGAGAGA...CGAGAGAAGAGAGAAGAGAG dm3_rmsk_(GA)n ra...
   [154989]   119 ATCGACTCCCTGGAAATCCCA...AAGAAGCCAAAATGGAAGCC dm3_rmsk_DMRT1B r...
   [154990]    52 ATGGAAGCCGGTCCAAAAAGA...GGCAGGTGTTCCCCAAGGAT dm3_rmsk_DMRT1B r...
   [154991]   155 TCTCTTCTCTTCTCTTCTCTT...TCTCTTCTCTTCTCTTCTCT dm3_rmsk_(TTCTC)n...
   [154992]   150 AGAGAAGAGAAGAGAAGAGAA...AGAGAAGAGAAGAGACGAGA dm3_rmsk_(GAGAA)n...
   [154993]   147 ACAAGACAAGACAAGACAAGA...AAGACAAGACAAGACAAGAC dm3_rmsk_(CAAGA)n...

It works fine for me. Here is my sessionInfo():

   > sessionInfo()
   R version 2.7.0 (2008-04-22)
   x86_64-unknown-linux-gnu

   locale:

LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base

   other attached packages:
   [1] Biostrings_2.8.12

Could you send yours please?

> I will try your suggestion

Note that if you only want to count the number of matches of your PDict object
in the repeat regions of the *entire* genome (without actually caring about
where the matches are actually occuring), then you could do something like
this (use read.XStringViews instead of read.DNAStringSet):

   > allrepeats <- read.XStringViews("dm3rm", format="fasta", subjectClass="DNAString", collapse="-")

and then call countPDict(pdict, subject(allrepeats)).
The purpose of using 'collapse="-"' when reading the file is to separate the
RepeatMasker chunks with this letter when they are all put together in the
big 'subject(allrepeats)' sequence. This is in order to avoid the matches
that would span across several chunks.

Cheers,
H.

> Thank you for your help
> 
> On Tue, 03 Jun 2008 17:14:10 -0700
>  Herve Pages <hpages at fhcrc.org> wrote:
>> Hi Joseph,
>>
>> Joseph Dhahbi, P.h.D. wrote:
>>>
>>> Hi
>>> I downloaded the drosophila RepeatMasker from UCSC GB as a text file 
>>> which is in fasta format and looks like this:
>>>> dm3_rmsk_NINJA_I range=chr4:2-434 5'pad=0 3'pad=0 strand=+ 
>>>> repeatMasking=none
>>> AATTCGCGTCCGCTTA......
>>>> dm3_rmsk_NINJA_LTR range=chr4:435-611 5'pad=0 3'pad=0 strand=+ 
>>>> repeatMasking=none
>>> TGTCGCGGATC....
>>>> dm3_rmsk_Baggins1 range=chr4:638-1723 5'pad=0 3'pad=0 strand=- 
>>>> repeatMasking=none
>>> ATACGATGG......
>>>
>>> I made the input dictionary and I would like to make the RepeatMasker 
>>> sequences as the target.  When I used ‘read.DNAStringSet’ it 
>>> recognized only the first sequence of the fasts file.  Ho do I merge 
>>> all of the sequences in and make them as a target.
>>
>> If your file is really FASTA then read.DNAStringSet() should extract all
>> the records and return a DNAStringSet object where each element 
>> corresponds
>> to a record in the original file. So it seems like you've hit a bug in 
>> the
>> read.DNAStringSet() function. Can you please provide the URL to the file
>> you downloaded so we can try to reproduce?
>>
>> Anyway, what you are trying to achieve can be done in an easier (and more
>> efficient) way. You don't need to download the RepeatMasker sequences for
>> this; just use the BSgenome.Dmelanogaster.UCSC.dm3 package. The 
>> RepeatMasker
>> information is already included in it as part of the built-in masks 
>> provided
>> for each chromosome:
>>
>>   > library(BSgenome.Dmelanogaster.UCSC.dm3)
>>   > Dmelanogaster
>>   Fly genome
>>   |
>>   | organism: Drosophila melanogaster
>>   | provider: UCSC
>>   | provider version: dm3
>>   | release date: Apr. 2006
>>   | release name: BDGP Release 5
>>   |
>>   | single sequences (see '?seqnames'):
>>   |   chr2L      chr2R      chr3L      chr3R      chr4      chrX       
>> chrU
>>   |   chrM       chr2LHet   chr2RHet   chr3LHet  chr3RHet   chrXHet    
>> chrYHet
>>   |   chrUextra
>>   |
>>   | multiple sequences (see '?mseqnames'):
>>   |   upstream1000  upstream2000  upstream5000
>>   |
>>   | (use the '$' or '[[' operator to access a given sequence)
>>   > chr2L <- Dmelanogaster$chr2L
>>   > chr2L
>>     23011544-letter "MaskedDNAString" instance (# for masking)
>>   seq: 
>> CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA 
>>
>>   masks:
>>     maskedwidth  maskedratio active                             names
>>   1         200 8.691290e-06  FALSE                     assembly gaps
>>   2     1966561 8.545976e-02  FALSE                      RepeatMasker
>>   3       61603 2.677048e-03  FALSE Tandem Repeats Finder [period<=12]
>>   all masks together:
>>     maskedwidth maskedratio
>>         1988181  0.08639929
>>   all active masks together:
>>     maskedwidth maskedratio
>>               0           0
>>
>> Note that the built-in masks are always inactive by default. To activate
>> a mask do:
>>
>>   > active(masks(chr2L))[2] <- TRUE  # activate the RepeatMasker mask
>>
>> Now only the parts of chr2L that are NOT repeat regions are visible.
>> To invert this, use gaps():
>>
>>   > chr2Lrepeats <- gaps(chr2L)
>>   > chr2Lrepeats
>>     23011544-letter "MaskedDNAString" instance (# for masking)
>>   seq: 
>> #GACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTT################### 
>>
>>   masks:
>>     maskedwidth maskedratio active
>>   1    21044983   0.9145402   TRUE
>>
>> Then use matchPDict() (or countPDict()) in the usual way.
>>
>> The GenomeSearching vignette in the BSgenome package has more
>> information about masking (some sections are still incomplete but
>> will be completed soon).
>>
>> Hope this helps,
>> H.
>>
>>
>>> Thank you for your help
>>>
>>>
>>> Regards,
>>> Joseph
>>>
>>> Joseph M. Dhahbi, PhD
>>> Childrens Hospital Oakland Research Institute
>>> 5700 Martin Luther King Jr. Way
>>> Oakland, CA 94609
>>> USA
>>> Ph.(510)428-3885 EXT.5743
>>> Cell.(702)335-0795
>>> Fax (510)450-7910
>>> jdhahbi at chori.org
>>> The email message (and any attachments) is for the sole...{{dropped:3}}
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
> 
> 
> 
> 
> Regards,
> Joseph
> 
> Joseph M. Dhahbi, PhD
> Childrens Hospital Oakland Research Institute
> 5700 Martin Luther King Jr. Way
> Oakland, CA 94609
> USA
> Ph.(510)428-3885 EXT.5743
> Cell.(702)335-0795
> Fax (510)450-7910
> jdhahbi at chori.org
> The email message (and any attachments) is for the sol...{{dropped:8}}