[BioC] how to deal with fasta "line is too long"

Tue Apr 10 04:52:49 CEST 2012

On 04/09/2012 03:44 PM, Marcus Davy wrote:
> Quick example generating a fasta file that does not contain newlines to
> illustrate;
>
> library(Biostrings)
>
> set.seed(42)
> n<- 20000
> dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="")
>
> ## Create a fasta file that does not contain newlines
> write(">test sequence", "test.fasta")
> write(dna, "test.fasta", append=TRUE)
>
> ## n=20,000 bases or above will fail
> try(read.DNAStringSet("test.fasta"))
> Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip,
> use.names,  :
>    reading FASTA file test.fasta: cannot read line 2, line is too long

It would be good to know the original use case; functions are written 
for different purposes, and for instance

   library(Rsamtools)
   fa = FaFile("test.fasta")
   indexFa(fa)
   (param = scanFaIndex(fa))

and finally

   scanFa(fa, param=param)

 > scanFa(fa, param=param)
   A DNAStringSet instance of length 1
     width seq                                               names 

[1] 20000 CCTCGGGAGGTGCTTCCATGCAC...ATTCTGTCTGGCATCACTAGGCC test

One might use scanFa to read (ranges) of long (e.g., genome-scale) fasta 
files, whereas read.DNAStringSet or ShortRead::readFasta are more suited 
for large collections of shorter sequences.

Martin

>
> n<- 20000-1
> dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="")
>
> write(">test sequence", "test.fasta")
> write(dna, "test.fasta", append=TRUE)
>
> ## 19999 bases or less will load
> read.DNAStringSet("test.fasta")
>    A DNAStringSet instance of length 1
>      width seq
> names
> [1] 19999 GCTCCTTGGACCGCTCACTGCTC...TTAGATTCACCTTGGCATGAAGT test sequence
>
> Marcus
>
>
>   sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New
> Zealand.1252
> [3] LC_MONETARY=English_New Zealand.1252
> LC_NUMERIC=C
> [5] LC_TIME=English_New Zealand.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] ShortRead_1.12.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
>   [4] Rsamtools_1.6.3     Biostrings_2.22.0   GenomicRanges_1.6.7
>   [7] IRanges_1.12.6      nlme_3.1-103        NGS_0.9.4
> [10] lattice_0.20-6
>
> loaded via a namespace (and not attached):
>   [1] Biobase_2.14.0     bitops_1.0-4.1     BSgenome_1.22.0
> grid_2.14.1
>   [5] hwriter_1.3        RCurl_1.91-1.1     rtracklayer_1.14.4
> tools_2.14.1
>   [9] XML_3.9-4.1        zlibbioc_1.0.1
>
> On Tue, Apr 10, 2012 at 6:17 AM, Marcus Davy<mdavy86 at gmail.com>  wrote:
>
>> Sounds like you have Fasta files which do not contain newlines, use the
>> linux command 'fold'
>> to fix this.
>>
>> fold [malformedFile]>  [newFile]
>>
>>  From memory, read.DNAStringSet() will fail if the file is larger than
>> 20,000 characters
>> and contains no newline feeds.
>>
>> Marcus
>>
>>
>> On Tue, Apr 10, 2012 at 12:50 AM, wang peter<wng.peter at gmail.com>  wrote:
>>
>>> hi all
>>> sorry to disturb you
>>>
>>> i forgot how to deal with too long fasta sequences ï¼Ÿ
>>> i remembered a person told me to use linux command line?
>>>
>>> thank you in advances
>>>
>>> --
>>> shan gao
>>> Room 231(Dr.Fei lab)
>>> Boyce Thompson Institute
>>> Cornell University
>>> Tower Road, Ithaca, NY 14853-1801
>>> Office phone: 1-607-254-1267(day)
>>> Official email:sg839 at cornell.edu
>>> Facebook:http://www.facebook.com/profile.php?id=100001986532253too long
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793