[BioC] VCF class: different length when unlisting INFO CompressedCharacterList
Valerie Obenchain
vobencha at fhcrc.org
Tue May 27 23:06:30 CEST 2014
Hi,
VariantAnnotation doesn't have a function that expands the data 'by
sample' as you've described but there are several helpers that can make
the process more straight forward.
(1) filterVcf()
This function filters records from a vcf and outputs another vcf on
disk. You can use this to create a subset file of records with
FILTER=PASS. The function can't be used to remove records with GT='.'
because of the structure of the vcf file. The filtering removes variants
(rows) that don't meet the criteria. In the case of the GT filtering you
are removing select variants per sample so this is better done at a
later stage.
Filtering by the presence/absence of a value can be done with a pre-filter:
## sample data
fl <- system.file("extdata", "chr7-sub.vcf.gz",
package="VariantAnnotation")
## pre-filter for 'PASS':
pass_filter = function(x) grepl("PASS", x, fixed=TRUE)
PF <- FilterRules(list(pass_filter))
## here the filtered vcf is sent to a temp file:
filtered_vcf <- filterVcf(fl, "hg19", tempfile(), prefilters=PF)
(2) ScanVcfParam
You may have already discovered that a ScanVcfParam object specifies
subsets of data to be read in. Using a 'param' can speed up readVcf()
and makes for a lighter VCF object in R.
## read in specific genotype (FORMAT) fields and no INFO data:
param <- ScanVcfParam(info=NA, geno=c("GT", "CGA_OAD", "CGA_ORDP"))
vcf <- readVcf(filtered_vcf, "hg19", param=param)
>> geno(vcf)
> List of length 3
> names(3): GT CGA_OAD CGA_ORDP
The file has 2 samples:
>> geno(vcf)$GT[1:4,]
> HCC1187-H-200-37-ASM-N1 HCC1187-H-200-37-ASM-T1
> 7:55003988_A/G "1/1" "1/1"
> 7:55010562_GA/G "1/1" "1/1"
> 7:55012097_T/C "1/0" "1/0"
> 7:55015505_C/T "1/0" "1/0"
More below.
On 05/26/2014 06:15 AM, Sigve Nakken wrote:
> Hi,
>
> I’ve had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the ‘expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data.
>
> Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded):
>
> ## read VCF (100 samples)
>> all_vcf <- readVcf(‘cancer.exome.project.vcf.gz','hg19')
>> seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19)
>> rowData(all_vcf)
> GRanges with 50383 ranges and 5 metadata columns:
> …
>
> ## get variants with ‘PASS’ filter
>> all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),]
>> rowData(all_vcf_PASS)
> GRanges with 4252 ranges and 5 metadata columns:
> ...
>
>
> ## get genotype information for all samples that have called genotypes (GT != ‘.’)
> ## From my VCF:
> ## FORMAT=<ID=ADT,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed (tumor)">
> ## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> ## FORMAT=<ID=ADC,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed (control)">
> ##
>> tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,])
>> normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,])
>> tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,])
>> normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,])
I don't think it's necessary (or efficient) to subset the data on GT =
'.'. The ADC and ADT values should be NA if GT was '.' and the
subsequent math operations (i.e., '+') will produce NAs but no errors.
If you really want to remove them you can use is.na(geno(all_vcf_PASS)$ADT.
>
> ## get sample ids and variant ids for the ‘expanded set of variants'
>> b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col
>> c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row
>> sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b])
>> variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c])
>
>
> ## make simple data.frame of all samples with genotype information
>> row.names(tumor_ref_support) <- NULL
>> row.names(normal_ref_support) <- NULL
>> row.names(tumor_alt_support) <- NULL
>> row.names(external_passed) <- NULL
>> row.names(normal_alt_support) <- NULL
>
>> tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support,tumor_alt_support,normal_ref_support,normal_alt_support))
>> colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support','tumor_alt_support','normal_ref_support','normal_alt_support')
>> tmp$allele_frac_tumor <- rep(0,nrow(tmp))
>> tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support
>> tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support
>> tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2)
Because the geno variables are stored in matrices you can add them 'as
is'. Underneath arrays are just vectors with a dimension attached. This
operation on my sample variable doesn't make biological sense but it's
the idea that you can just add the arrays of 'tumor_ref_support' and
'tumor_alt_support'.
## bogus new variable
new_var <- geno(vcf)$CGA_OAD + geno(vcf)$CGA_OAD
The result is another array with the same dimensions:
>> dim(geno(vcf)$CGA_OAD)
> [1] 1184 2 2
>> dim(new_var)
> [1] 1184 2 2
Isolate the samples to separate array elements by rearranging the array
with aperm().
ss <- aperm(new_var, c(1, 3, 2))
array_dims <- dim(ss)
>> array_dims
> [1] 1184 2 2
Create vectors of variant and sample names:
variant_id <- rep.int(dimnames(vcf)[[1]], array_dims[2] * array_dims[3])
sample_id <- rep(dimnames(vcf)[[2]], each=array_dims[1] * array_dims[2])
Create the data.frame:
df <- data.frame(variant_id, sample_id, new_var=as.vector(ss))
>> head(df)
> variant_id sample_id new_var
> 1 7:55003988_A/G HCC1187-H-200-37-ASM-N1 168
> 2 7:55010562_GA/G HCC1187-H-200-37-ASM-N1 100
> 3 7:55012097_T/C HCC1187-H-200-37-ASM-N1 64
> 4 7:55015505_C/T HCC1187-H-200-37-ASM-N1 38
> 5 7:55015541_A/. HCC1187-H-200-37-ASM-N1 NA
> 6 7:55015699_C/T HCC1187-H-200-37-ASM-N1 20
You might also take a look at the VRanges class. It's a flat/light form
of vcf data necessary for variant calling and filtering. It doesn't
expand the genotype data as per this example but it may be useful to you
in other applications. See ?VRanges for details.
Valerie
>
> ##
>> head(tmp)
> variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth
> 1 chr5_10000000_T_C 001B_001T 17 7 21 0 0.29 24 21
> 2 chr11_10000000_C_T 001B_001T 31 4 33 0 0.11 35 33
> 3 chr18_10000000_C_T 001B_001T 21 7 37 1 0.25 28 38
> 4 chrY_1000000_A_G 001B_001T 10 2 11 0 0.17 12 11
> 5 chr1_1000000_G_C 011B_011T 17 3 48 0 0.15 20 48
> 6 chr1_1000000_A_G 011B_011T 77 13 114 0 0.14 90 114
>
>> str(tmp)
> 'data.frame': 4418 obs. of 9 variables:
> $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ...
> $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ...
> $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ...
> $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ...
> $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ...
> $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ...
> $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ...
> $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ...
> $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ...
>
> Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward.
>
> If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this.
>
> ---
> Sigve Nakken, PhD
> Postdoctoral Fellow, Dept. of Tumor Biology
> Institute for Cancer Research
> Oslo University Hospital, Norway
> phone: +4795753022
> email: sigven at ifi.uio.no
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fhcrc.org
Phone: (206) 667-3158
More information about the Bioconductor
mailing list