[BioC] Limit on number of sequence files for forging a BSgenome

Thu Mar 28 17:58:21 CET 2013

Kasper,

I see your line of thought, is there a particular fasta file causing
forgeBSgenomeDataPkg() to break?

The answer is no. Once I reach a certain number of fasta files, adding one
more contig breaks the function. For instance, taking the first 454
contigs of C. brenneri breaks while removing the last or the first fasta
file from the list (keeping only 453) compile without a problem (neither
the last or the first fasta files are responsible for breaking the
function, the number of file is the trigger)

What's even more puzzling is that the number that breaks is not a fixed
number. Selecting a random selection of contigs or changing genome will
change the number that triggers the function to break... However it's
always around 440 files, which might be due to the size of the fasta files
being all of very similar sizes.

Any clues? 

--  Marco Blanchette, Ph.D.
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at gmail.com>
wrote:

>Marco,
>
>You are probably right in diagnosing the problem, but sometimes I
>think I have seen FASTA files with the entire sequence on a single
>line, instead of (say) 80 nucleotides and then a newline.  I could
>believe that a really long contig on a single line without a newline,
>could cause an error like this. You could quickly check if there is a
>suspicious file by
>  wc -l *
>and look for files with #lines like 2-3.  Somehow 460 seems a weird
>number to fail at.
>
>This may not be your problem, and I am sure Herve will respond in due
>time.
>
>Best,
>Kasper
>
>On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <MAB at stowers.org>
>wrote:
>> Hi,
>>
>> Is there a maximum number of sequence files (chromosomes or contigs in
>>my case) that can be fed to the forgeBSgenomeDataPkg() function? I am
>>trying to build a BSgenome for C. brenneri and C. japonica available
>>from EnsemblGenomes. These genomes are made from thousands of contigs
>>with genes annotated to them. Currently, I get the following error when
>>running "Error: Line longer than buffer size" when running on the full
>>set of contigs. However, it works fine on a seed file containing a
>>subset of the contigs (I can forge a genome with 450 contigs but not
>>with 460!)
>>
>> Any suggestions will be appreciated (I can provide a toy example but I
>>am not sure what would be the merit of it at this point)
>>
>> Thanks
>>
>> --  Marco Blanchette, Ph.D.
>> Stowers Institute for Medical Research
>> 1000 East 50th Street
>> Kansas City MO 64110
>> www.stowers.org
>>
>> Tel: 816-926-4071
>> Cell: 816-726-8419
>> Fax: 816-926-2018
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>http://news.gmane.org/gmane.science.biology.informatics.conductor