[BioC] Extarcting large number of sequences using BSgenome package

Hervé Pagès hpages at fhcrc.org
Thu Aug 18 16:30:30 CEST 2011


Viritha,

On 11-08-16 10:54 AM, viritha kaza wrote:
> Hi Herve,
> Thanks for your response. System is considering as characters in the
> dataframe.So I was thinking of using them as vectors.But if I scan as
> seperate elements and convert them to numeric using lapply.It becomes
> atomic numeric.How do I make them as vectors without using combining
> fuction as c(1,2,3)?I am basically new to sequences using R.So will read
> write.XStringSet().

Your question is particularly unclear and it seems like you would
benefit a lot from reading a little bit about data frames and
read.table(). There are a lot of arguments to this function that
give you a lot of control on how the data will actually be loaded
into the data frame. In particular look at the 'colClasses' arg:
it lets you control the types of the columns of the data frame.

That will help you solve the first step which is loading the
file into a data frame. I don't think you need any
scan/lappl/combining for this step.

Once you've solved this, step 2 will be to use getSeq() to extract
your sequences and step 3 will be to write your sequences to a FASTA
file with write.XStringSet().

Right now, it's almost impossible to know where you are encountering
difficulties. Please have a look at our posting guide before your
next post:

   http://bioconductor.org/help/mailing-list/posting-guide/

Thanks!
H.


> Thanks,
> Viritha
> 2011/8/16 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Viritha,
>
>
>     On 11-08-16 08:01 AM, viritha kaza wrote:
>
>         Hi group,
>         I am interested in retrieving about 2000 sequences with the specific
>         chromosome number,start and end site.
>         I was thinking of using BSgenome package for this.
>
>
>             source("http://bioconductor.__org/biocLite.R
>             <http://bioconductor.org/biocLite.R>")
>
>
>             biocLite("BSgenome","BSgenome.__Hsapiens.UCSC.hg19")
>
>
>             library(BSgenome)
>
>
>             library("BSgenome.Hsapiens.__UCSC.hg19")
>
>
>             myseq<- getSeq(Hsapiens,"chr2",start=__10000,end=10020)
>
>
>         # this would work for specific values.
>
>         #So I tried to use a dataframe where it would retrieve
>         chromosome number
>         from by using
>
>             full<-as.matrix(read.table(“__test_seq.txt”,sep=’\t’,quote=”__”,header=True,
>
>         as.is <http://as.is/>=TRUE))
>
>
>     Why are you doing as.matrix here? Do you *really* want all the columns
>     to be coerced to the same type? (which in your case is probably
>     character because you have already one column of that type).
>
>
>
>             full.df<-data.frame(full)
>
>
>     So you're back to a data.frame, except that now all the columns are
>     character vectors :-/
>
>
>
>         # test_seq contains this information
>
>         #chromosome Start End
>         #chr2 10000 10020
>         #chr3 10000 10020
>
>
>     Why not copy and paste what you get when you just type full.df? That
>     way would know exactly what you ended up with (if full.df is too big,
>     just show us the output of head(full.df)).
>     Anyway, it looks like you'll have to get rid of those #. Otherwise
>     passing Hsapiens,full.df$chromosome to getSeq() with those # in it
>     will surely cause problems.
>
>
>
>             myseq<-
>             getSeq(Hsapiens,full.df$__chromosome,start=10000,end=__10020)
>
>
>         #but then when I use start=full.df$Start. It naturally throws an
>         error
>         saying 'start' must be a vector of integers
>
>         Questions:
>
>         How Do I handle this?
>
>
>     By having full.df$Start be a vector of integers and also by reading
>     the man page for getSeq().
>
>
>
>         Does start here mean that each chromosome numbering starts from 1?
>
>
>     Yes, the first base at the 5' end of the + strand (or 3' end of the -
>     strand) is considered to be position 1.
>
>
>
>         How do I split each sequence retrieved and create as fasta
>         format (>)with
>         sequence name attached to them retrieved from my input file?
>
>
>     How do you want to "split" each sequence retrieved? As a good starting
>     point you could have a look at write.XStringSet(). Note that
>     the names you put on your DNAStringSet object will be used as the
>     record description lines in the resulting FASTA file.
>
>     Cheers,
>     H.
>
>
>
>
>             sessionInfo()
>
>
>         R version 2.13.0 (2011-04-13)
>         Platform: i386-pc-mingw32/i386 (32-bit)
>
>         locale:
>         [1] LC_COLLATE=English_United States.1252
>         [2] LC_CTYPE=English_United States.1252
>         [3] LC_MONETARY=English_United States.1252
>         [4] LC_NUMERIC=C
>         [5] LC_TIME=English_United States.1252
>
>         attached base packages:
>         [1] stats     graphics  grDevices utils     datasets  methods   base
>
>         other attached packages:
>         [1] BSgenome.Hsapiens.UCSC.hg19_1.__3.17 BSgenome_1.20.0
>         [3] Biostrings_2.20.1                  GenomicRanges_1.4.6
>         [5] IRanges_1.10.4
>
>         loaded via a namespace (and not attached):
>         [1] tools_2.13.0
>
>
>
>         Waiting for your suggestions,
>
>         Thank you,
>
>         Viritha
>
>                 [[alternative HTML version deleted]]
>
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list