[BioC] read sequences from fasta file starting with > sign and untill next > sign
Martin Morgan
mtmorgan at fhcrc.org
Fri Sep 14 15:52:14 CEST 2012
On 09/14/2012 06:11 AM, Jack [guest] wrote:
>
> Hi:
>
> I am trying to read sequences from a fasta file starting with > till the next > sign:
>
> library(ShortRead)
> setwd("fastafolder");
> con <- file("somefastafile.fa");
> open(con)
> pattern <- as.character("TACC")
> while(length(res <- readLines(con, n=1)))
> {
> #do something
> }
> close(con)
>
> With this while statement I am able to read a single line from the fasta file each time. But I want to read a chunk of links each time from the fasta file starting with > sign and till the next > sign.
>
> Example
>> AAATTT
> TAGGCT
> ATTTGC
>> CGATTT
>
> And I want to read the following in the first run of while loop
>> AAATTT
> TAGGCT
> ATTTGC
>
> Thanks for your help.
Hi Jack -- in R it's not generally useful to operate one record at a
time, so think about reading many records (a million?) in and then
processing as a vector. You could use FaFile in Rsamtools
fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa")
## create an index, if neceessary
indexFa(fl)
fa <- FaFile(fl)
hdr <- scanFaIndex(fa)
## record-at-a-time
open(fa)
for (i in seq_along(hdr))
scanFa(fa, hdr[i])
close(fa)
## groups, e.g., 10
idx <- seq_along(hdr)
grps <- split(idx, cut(idx, 10, labels=FALSE))
open(fa)
for (grp in grps)
scanFa(fa, hdr[grp])
close(fa)
or role your own
fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa")
con <- file(fl)
open(con)
buf <- character()
while (length(inpt <- readLines(con, 10))) {
lastidx <- tail(grep("^>", inpt), 1)
complete <- c(buf, inpt[seq(1, lastidx - 1L)])
buf <- inpt[seq(lastidx, length(inpt))]
if (length(complete)) {
## 'complete' contains complete record(s)
}
}
## 'buf' might contain some final, complete, records
close(con)
Martin
>
> Regards:
> Jack
>
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] CHNOSZ_0.9-7 ShortRead_1.14.4 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.8.6 lattice_0.20-10 Biostrings_2.24.1
> [8] GenomicRanges_1.8.13 IRanges_1.14.4 BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.1 hwriter_1.3 stats4_2.15.1 tools_2.15.1 zlibbioc_1.2.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list