[BioC] VCF class: different length when unlisting INFO CompressedCharacterList
Valerie Obenchain
vobencha at fhcrc.org
Tue May 14 17:20:40 CEST 2013
Hi Francesco,
The expand,VCF-method was written for this purpose. Using expand() on a
VCF will produce an object that is 'flattened' in the sense that the
variant rows are repeated to match the unlisted ALT column. expand()
will unlist ALT and any INFO or FORMAT variables that have one value per
alternate allele which is indicated by 'Number=A' in the header. For
example,
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
If you are working with a DataFrame, you can use expand() to specify
exactly which columns you want 'flattened'.
> DF <- DataFrame(one=IntegerList(1:3, 4, 5),
two=letters[1:3],
three=CharacterList("A", c("B", "C"), "D"))
> expand(DF, colnames="three", keepEmptyRows=FALSE)
DataFrame with 4 rows and 3 columns
one two three
<IntegerList> <character> <character>
1 1,2,3 a A
2 4 b B
3 4 b C
4 5 c D
Details and examples are at,
?'VCF-class' ## VCF method
?'expand' ## DataFrame method
I think this is what you were after ... let me know if this doesn't
answer your question.
Valerie
On 05/14/13 01:09, Francesco Lescai wrote:
> Hi all and Hi Valerie (I suppose),
> I was extracting a field of the INFO column from a VCF, but when I unlist it I get a different length compared the number of variants, so I don't know anymore which refers to each variant.
>
> Here's what I'm doing
>
>> vcf
> class: VCF
> dim: 50273 30
> genome: hg19
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(28): AC AF ... culprit set
> geno(5): AD DP GQ GT PL
> rownames(50273):
> [.. cut for clarity ..]
>
> genotypes<-as.data.frame(geno(vcf)$GT)
> dim(genotypes)
> [1] 50273 30
>
> list.va<-info(vcf)$VA
>> length(info(vcf)$VA)
> [1] 50273
>
>> list.va
> CompressedCharacterList of length 50273
>
> info.va<-unlist(info(vcf)$VA)
>> length(info.va)
> [1] 53391
>
> This is an annotation from Variant Annotation Tool, which modifies the VCF Info.
> But if I do the same for other more "standard" fields, some of them have the same length of the variants, others don't when unlisted
>
>> length(unlist(info(vcf)$HaplotypeScore))
> [1] 50273
>> length(unlist(info(vcf)$AC))
> [1] 50489
>> length(unlist(info(vcf)$AF))
> [1] 50489
>
> am I doing something wrong? or is the unlist method on the CompressedCharacterList splitting on some field delimiter?
>
> below my sessionInfo.
> thanks for any help you might provide,
> cheers,
> Francesco
>
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] reshape_0.8.4 plyr_1.8 ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12 Rsamtools_1.10.2
> [7] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0 biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.4
> [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0 digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1 gridExtra_0.9.1
> [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1 lattice_0.20-15 MASS_7.3-23 munsell_0.4 parallel_2.15.1
> [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3
> [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 XML_3.96-1.1 zlibbioc_1.4.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list