[BioC] VCF class: different length when unlisting INFO CompressedCharacterList
Martin Morgan
mtmorgan at fhcrc.org
Tue May 14 19:17:29 CEST 2013
Hi Francesco -- to add to what Valerie said...
On 05/14/2013 08:20 AM, Valerie Obenchain wrote:
> Hi Francesco,
>
> The expand,VCF-method was written for this purpose. Using expand() on a VCF will
I'm not sure, but I think expand,VCF-method was introduced more recently than
the version of R / Bioconductor you have installed; maybe it's time to update?
...
> produce an object that is 'flattened' in the sense that the variant rows are
> repeated to match the unlisted ALT column. expand() will unlist ALT and any INFO
> or FORMAT variables that have one value per alternate allele which is indicated
> by 'Number=A' in the header. For example,
>
> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
>
>
> If you are working with a DataFrame, you can use expand() to specify exactly
> which columns you want 'flattened'.
>
> > DF <- DataFrame(one=IntegerList(1:3, 4, 5),
> two=letters[1:3],
> three=CharacterList("A", c("B", "C"), "D"))
> > expand(DF, colnames="three", keepEmptyRows=FALSE)
> DataFrame with 4 rows and 3 columns
> one two three
> <IntegerList> <character> <character>
> 1 1,2,3 a A
> 2 4 b B
> 3 4 b C
> 4 5 c D
>
> Details and examples are at,
> ?'VCF-class' ## VCF method
> ?'expand' ## DataFrame method
>
> I think this is what you were after ... let me know if this doesn't answer your
> question.
>
> Valerie
>
>
>
> On 05/14/13 01:09, Francesco Lescai wrote:
>> Hi all and Hi Valerie (I suppose),
>> I was extracting a field of the INFO column from a VCF, but when I unlist it I
>> get a different length compared the number of variants, so I don't know
>> anymore which refers to each variant.
>>
>> Here's what I'm doing
>>
>>> vcf
>> class: VCF
>> dim: 50273 30
>> genome: hg19
>> exptData(1): header
>> fixed(4): REF ALT QUAL FILTER
>> info(28): AC AF ... culprit set
>> geno(5): AD DP GQ GT PL
>> rownames(50273):
>> [.. cut for clarity ..]
>>
>> genotypes<-as.data.frame(geno(vcf)$GT)
Hmm, not sure that this is a good idea -- geno(vcf)$GT is a matrix, operations
on a matrix can be much more efficient than on a data.frame, and the matrix
conveys valuable information about data types, in particular that all elements
are the same class.
>> dim(genotypes)
>> [1] 50273 30
>>
>> list.va<-info(vcf)$VA
>>> length(info(vcf)$VA)
>> [1] 50273
>>
>>> list.va
>> CompressedCharacterList of length 50273
>>
>> info.va<-unlist(info(vcf)$VA)
>>> length(info.va)
>> [1] 53391
it's worth understanding why this column is represented as a CharacterList (the
listed form) rather than a simple character vector (the unlisted form).
According to the INFO line from the VCF header Val mentions, some members of
this field have more than one entry, specifically
info(vcf)$VA[elementLengths(info(vcf)$VA != 1]
Here's a simple example
> x = CharacterList(list(letters[1:2], LETTERS[1:3], character(0)))
> x
CharacterList of length 2
[[1]] a b
[[2]] A B C
[[3]] character(0)
> unlist(x)
[1] "a" "b" "A" "B" "C"
I guess you are perhaps hoping for
> sapply(x, paste, collapse=",")
[1] "a,b" "A,B,C" ""
but in some ways this isn't so convenient as it first appears -- now you'll have
to split() these back out to work on them, and you've translated 'no data'
(character(0)) to 'no characters' (""). One pattern that simplifies working with
these CharacterList and similar objects is that they can be unlist'ed and
relist'ed cheaply, so long as the total number of elements and their order don't
change.
> y = toupper(unlist(x))
> relist(y, x)
CharacterList of length 2
[[1]] A B
[[2]] A B C
[[3]] character(0)
(actually, toupper(x) works without unlisting so this isn't a perfect example ...)
Martin
>>
>> This is an annotation from Variant Annotation Tool, which modifies the VCF Info.
>> But if I do the same for other more "standard" fields, some of them have the
>> same length of the variants, others don't when unlisted
>>
>>> length(unlist(info(vcf)$HaplotypeScore))
>> [1] 50273
>>> length(unlist(info(vcf)$AC))
>> [1] 50489
>>> length(unlist(info(vcf)$AF))
>> [1] 50489
>>
>> am I doing something wrong? or is the unlist method on the
>> CompressedCharacterList splitting on some field delimiter?
>>
>> below my sessionInfo.
>> thanks for any help you might provide,
>> cheers,
>> Francesco
>>
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] reshape_0.8.4 plyr_1.8
>> ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12
>> Rsamtools_1.10.2
>> [7] Biostrings_2.26.3 GenomicRanges_1.10.7
>> IRanges_1.16.6 BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0
>> biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1
>> cluster_1.14.4
>> [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0
>> digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1
>> gridExtra_0.9.1
>> [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1
>> lattice_0.20-15 MASS_7.3-23 munsell_0.4
>> parallel_2.15.1
>> [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1
>> reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3
>> [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1
>> XML_3.96-1.1 zlibbioc_1.4.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list