[BioC] how to deal with fasta "line is too long"
Martin Morgan
mtmorgan at fhcrc.org
Tue Apr 10 04:52:49 CEST 2012
On 04/09/2012 03:44 PM, Marcus Davy wrote:
> Quick example generating a fasta file that does not contain newlines to
> illustrate;
>
> library(Biostrings)
>
> set.seed(42)
> n<- 20000
> dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="")
>
> ## Create a fasta file that does not contain newlines
> write(">test sequence", "test.fasta")
> write(dna, "test.fasta", append=TRUE)
>
> ## n=20,000 bases or above will fail
> try(read.DNAStringSet("test.fasta"))
> Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip,
> use.names, :
> reading FASTA file test.fasta: cannot read line 2, line is too long
It would be good to know the original use case; functions are written
for different purposes, and for instance
library(Rsamtools)
fa = FaFile("test.fasta")
indexFa(fa)
(param = scanFaIndex(fa))
and finally
scanFa(fa, param=param)
> scanFa(fa, param=param)
A DNAStringSet instance of length 1
width seq names
[1] 20000 CCTCGGGAGGTGCTTCCATGCAC...ATTCTGTCTGGCATCACTAGGCC test
One might use scanFa to read (ranges) of long (e.g., genome-scale) fasta
files, whereas read.DNAStringSet or ShortRead::readFasta are more suited
for large collections of shorter sequences.
Martin
>
> n<- 20000-1
> dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="")
>
> write(">test sequence", "test.fasta")
> write(dna, "test.fasta", append=TRUE)
>
> ## 19999 bases or less will load
> read.DNAStringSet("test.fasta")
> A DNAStringSet instance of length 1
> width seq
> names
> [1] 19999 GCTCCTTGGACCGCTCACTGCTC...TTAGATTCACCTTGGCATGAAGT test sequence
>
> Marcus
>
>
> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New
> Zealand.1252
> [3] LC_MONETARY=English_New Zealand.1252
> LC_NUMERIC=C
> [5] LC_TIME=English_New Zealand.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.7
> [7] IRanges_1.12.6 nlme_3.1-103 NGS_0.9.4
> [10] lattice_0.20-6
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0
> grid_2.14.1
> [5] hwriter_1.3 RCurl_1.91-1.1 rtracklayer_1.14.4
> tools_2.14.1
> [9] XML_3.9-4.1 zlibbioc_1.0.1
>
> On Tue, Apr 10, 2012 at 6:17 AM, Marcus Davy<mdavy86 at gmail.com> wrote:
>
>> Sounds like you have Fasta files which do not contain newlines, use the
>> linux command 'fold'
>> to fix this.
>>
>> fold [malformedFile]> [newFile]
>>
>> From memory, read.DNAStringSet() will fail if the file is larger than
>> 20,000 characters
>> and contains no newline feeds.
>>
>> Marcus
>>
>>
>> On Tue, Apr 10, 2012 at 12:50 AM, wang peter<wng.peter at gmail.com> wrote:
>>
>>> hi all
>>> sorry to disturb you
>>>
>>> i forgot how to deal with too long fasta sequences ?
>>> i remembered a person told me to use linux command line?
>>>
>>> thank you in advances
>>>
>>> --
>>> shan gao
>>> Room 231(Dr.Fei lab)
>>> Boyce Thompson Institute
>>> Cornell University
>>> Tower Road, Ithaca, NY 14853-1801
>>> Office phone: 1-607-254-1267(day)
>>> Official email:sg839 at cornell.edu
>>> Facebook:http://www.facebook.com/profile.php?id=100001986532253too long
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list