[BioC] rsamtools scanBcf segfault

Valerie Obenchain vobencha at fhcrc.org
Thu Sep 15 01:40:35 CEST 2011


Hi Rob,

I wanted to point out some new functionality in Rsamtools (in devel) for 
reading vcf files,

scanVcfHeader
scanVcf
unpackVcf

and one function in the VariantAnnotation package (also in devel),

readVcf

scanVcf returns a list of the 9 elements in a VCF file and unpackVcf 
further parses the INFO and GENO fields. readVcf() in VariantAnnotation 
puts the data into a SummarizedExperiment object.

Feedback on these functions is welcome.

Valerie



On 07/28/11 06:36, Martin Morgan wrote:
> On 07/28/2011 12:08 AM, Rob Syme wrote:
>>> Thanks Rob for the nice reproducible example. The problem is when 
>>> you open
>>> the BcfFile; 'mode' should be 'rb' to indicate that you want to read a
>>> binary file. Or leave it unspecified and the heuristic will do the 
>>> right
>>> thing.
>
> Thanks that'll be corrected.
>
>>
>> Thanks Martin, the manual at
>> http://www.bioconductor.org/packages/2.8/bioc/manuals/Rsamtools/man/Rsamtools.pdf 
>>
>> includes a minor error that threw me off. The "Usage" section is
>> correct (and I should have read this more closely), but in the
>> "Arguments" section, the mode argument is described as:
>> "A character(1) vector; mode="rw" indicates a binary (BCF) file,
>> mode="r" a text (VCF) file"
>> when it should probably be
>> "A character(1) vector; mode="rb" indicates a binary (BCF) file,
>> mode="r" a text (VCF) file"
>>
>>
>> I have another error - and I've had a closer read of the manual this 
>> time ;)
>> I have generated a filtered vcf using freebayes, but the following
>> commands throw an error:
>>
>> wget 
>> http://gist.github.com/raw/1111101/f8629dfa20b15893684c78a586f321dbc65c6b6c/subset.vcf
>> grep -v "^#" subset.vcf | cut -f 1 | sort -u>  subset.dict
>> bcftools view -bSD subset.dict subset.vcf>  subset.bcf
>> bcftools index subset.bcf
>> wget 
>> http://gist.github.com/raw/1111101/5684d61541b39fb8c02e765d5f7891d9d122244e/scanBcf_error.R
>> Rscript scanBcf_error.R
>>
>> # This gives the error:
>> Error: scanBcf: failed to find fmt encoded as '21057'
>>    path:<<redacted>>/subset.bcf
>> Execution halted
>
> The problem is that the current implementation isn't really general, 
> it implements the formats supported (computed on, rather than just 
> echoed) by bcftools -- PL, DP, GQ, SP, GT, GL -- whereas vcf is really 
> much more flexible than that. This is cryptically documented on 
> ?scanBcf, but I'd like to make this more flexible and will work on this.
>
>> # Calling scanBcfHeader also gives an error:
>> scanBcfHeader(bcf)
>> Error in mapply(f, ..., SIMPLIFY = FALSE) :
>>    'names' attribute [5] must be the same length as the vector [4]
>
> This has been addressed previously in the devel branch.
>
> Thanks Rob for the reports.
>
> Martin
>
>>
>>
>> I can't see anything wrong with the VCF, am I missing something 
>> obvious again?
>
>



More information about the Bioconductor mailing list