[BioC] Extarcting large number of sequences using BSgenome package
Hervé Pagès
hpages at fhcrc.org
Thu Aug 18 16:30:30 CEST 2011
Viritha,
On 11-08-16 10:54 AM, viritha kaza wrote:
> Hi Herve,
> Thanks for your response. System is considering as characters in the
> dataframe.So I was thinking of using them as vectors.But if I scan as
> seperate elements and convert them to numeric using lapply.It becomes
> atomic numeric.How do I make them as vectors without using combining
> fuction as c(1,2,3)?I am basically new to sequences using R.So will read
> write.XStringSet().
Your question is particularly unclear and it seems like you would
benefit a lot from reading a little bit about data frames and
read.table(). There are a lot of arguments to this function that
give you a lot of control on how the data will actually be loaded
into the data frame. In particular look at the 'colClasses' arg:
it lets you control the types of the columns of the data frame.
That will help you solve the first step which is loading the
file into a data frame. I don't think you need any
scan/lappl/combining for this step.
Once you've solved this, step 2 will be to use getSeq() to extract
your sequences and step 3 will be to write your sequences to a FASTA
file with write.XStringSet().
Right now, it's almost impossible to know where you are encountering
difficulties. Please have a look at our posting guide before your
next post:
http://bioconductor.org/help/mailing-list/posting-guide/
Thanks!
H.
> Thanks,
> Viritha
> 2011/8/16 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi Viritha,
>
>
> On 11-08-16 08:01 AM, viritha kaza wrote:
>
> Hi group,
> I am interested in retrieving about 2000 sequences with the specific
> chromosome number,start and end site.
> I was thinking of using BSgenome package for this.
>
>
> source("http://bioconductor.__org/biocLite.R
> <http://bioconductor.org/biocLite.R>")
>
>
> biocLite("BSgenome","BSgenome.__Hsapiens.UCSC.hg19")
>
>
> library(BSgenome)
>
>
> library("BSgenome.Hsapiens.__UCSC.hg19")
>
>
> myseq<- getSeq(Hsapiens,"chr2",start=__10000,end=10020)
>
>
> # this would work for specific values.
>
> #So I tried to use a dataframe where it would retrieve
> chromosome number
> from by using
>
> full<-as.matrix(read.table(“__test_seq.txt”,sep=’\t’,quote=”__”,header=True,
>
> as.is <http://as.is/>=TRUE))
>
>
> Why are you doing as.matrix here? Do you *really* want all the columns
> to be coerced to the same type? (which in your case is probably
> character because you have already one column of that type).
>
>
>
> full.df<-data.frame(full)
>
>
> So you're back to a data.frame, except that now all the columns are
> character vectors :-/
>
>
>
> # test_seq contains this information
>
> #chromosome Start End
> #chr2 10000 10020
> #chr3 10000 10020
>
>
> Why not copy and paste what you get when you just type full.df? That
> way would know exactly what you ended up with (if full.df is too big,
> just show us the output of head(full.df)).
> Anyway, it looks like you'll have to get rid of those #. Otherwise
> passing Hsapiens,full.df$chromosome to getSeq() with those # in it
> will surely cause problems.
>
>
>
> myseq<-
> getSeq(Hsapiens,full.df$__chromosome,start=10000,end=__10020)
>
>
> #but then when I use start=full.df$Start. It naturally throws an
> error
> saying 'start' must be a vector of integers
>
> Questions:
>
> How Do I handle this?
>
>
> By having full.df$Start be a vector of integers and also by reading
> the man page for getSeq().
>
>
>
> Does start here mean that each chromosome numbering starts from 1?
>
>
> Yes, the first base at the 5' end of the + strand (or 3' end of the -
> strand) is considered to be position 1.
>
>
>
> How do I split each sequence retrieved and create as fasta
> format (>)with
> sequence name attached to them retrieved from my input file?
>
>
> How do you want to "split" each sequence retrieved? As a good starting
> point you could have a look at write.XStringSet(). Note that
> the names you put on your DNAStringSet object will be used as the
> record description lines in the resulting FASTA file.
>
> Cheers,
> H.
>
>
>
>
> sessionInfo()
>
>
> R version 2.13.0 (2011-04-13)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg19_1.__3.17 BSgenome_1.20.0
> [3] Biostrings_2.20.1 GenomicRanges_1.4.6
> [5] IRanges_1.10.4
>
> loaded via a namespace (and not attached):
> [1] tools_2.13.0
>
>
>
> Waiting for your suggestions,
>
> Thank you,
>
> Viritha
>
> [[alternative HTML version deleted]]
>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list