[BioC] merging VCF files
Valerie Obenchain
vobencha at fhcrc.org
Mon Feb 25 22:27:06 CET 2013
Hi Stephanie,
In the case of same samples, different order I've renamed 'Samples' to
'Samples.*' and kept them all. cbind() also keeps all columns so at
least this is consistent between the two. An error will still be thrown
for all other columns in colData with the same name but different data.
Here is the behavior in 1.5.40.
VCF file 1:
> colData(vcf)
DataFrame with 3 rows and 1 column
Samples
<integer>
NA00001 1
NA00002 2
NA00003 3
VCF file 2 with samples in different order:
> colData(vcf2)
DataFrame with 3 rows and 1 column
Samples
<integer>
NA00003 1
NA00002 2
NA00001 3
Reorder the second file so we can rbind with the first.
> vcf2 <- vcf2[,colnames(vcf)]
> colData(vcf2)
DataFrame with 3 rows and 1 column
Samples
<integer>
NA00001 3
NA00002 2
NA00003 1
The result keeps all columns. The .* extension corresponds to the order
the files were input to rbind().
> res <- rbind(vcf, vcf2)
> colData(res)
DataFrame with 3 rows and 2 columns
Samples.1 Samples.2
<integer> <integer>
1 1 3
2 2 2
3 3 1
I missed the point of reordering by chromosome blocks in your example
below. Was there trouble with the ordering of other slots after rbind()
or ... ?
Thanks,
Valerie
On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote:
> I don't think that this qualifies as a "problem," but I found that an
> extra step is needed (beyond my initial simple use case) to harmonize
> colData if the colnames of the VCFs have to be re-ordered. From the man
> page for "VCF-class" it is not clear that "colData(x) <- value" is
> possible, but happily it works.
>
> Here is what I did:
>
> svp <- ScanVcfParam(geno="GT")
> vcf1 <- readVcf(vcffile1, "hg19", svp)
> vcf2 <- readVcf(vcffile2, "hg19", svp)
>
> ## vcf1 and vcf2 have the same samples, but in different order
> vcf2 <- vcf2[,colnames(vcf1)]
> vcf <- rbind(vcf1, vcf2)
> ## Error in FUN("Samples"[[1L]], ...) :
> ## column(s) 'Samples' in ‘colData’ are duplicated and the data do not
> match
>
> ## colData is automatically created by readVcf with "Samples" ordered 1:N
> ## after re-ordering columns, have to change it in one of the VCFs to
> use rbind
> colData(vcf2) <- colData(vcf1)
> vcf <- rbind(vcf1, vcf2)
>
> seqnames(vcf)
> ## factor-Rle of length 226660 with 48 runs
> ## Lengths: 19000 7086 11905 9249 3070 ... 2050 1568 1855 824
> 5
> ## Values : 1 10 11 12 13 ... 7 8 9 X
> Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
> ## re-order rows so chromosomes are in blocks
> vcf <- vcf[order(rownames(vcf)),]
> seqnames(vcf)
> ## factor-Rle of length 226660 with 24 runs
> ## Lengths: 23538 8847 14865 11463 3835 ... 10391 8244 9637 4695
> 33
> ## Values : 1 10 11 12 13 ... 7 8 9 X
> Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
> ## we can also order by chromosome and position
> chrom <- as.character(seqnames(vcf))
> chrom[chrom == "X"] <- 23
> chrom[chrom == "Y"] <- 24
> chrom <- as.integer(chrom)
> pos <- start(ranges(rowData(vcf)))
> vcf <- vcf[order(chrom, pos),]
> seqnames(vcf)
> ## factor-Rle of length 226660 with 24 runs
> ## Lengths: 23538 16118 13559 9365 10466 ... 6120 2642 4982 4695
> 33
> ## Values : 1 2 3 4 5 ... 20 21 22 X
> Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
>
> Thanks for the helpful additions!
>
> Stephanie
>
> On 2/5/13 1:43 PM, Valerie Obenchain wrote:
>> rbind and cbind are now implemented for SummarizedExperiment
>> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34).
>>
>> rbind is appropriate for the case of different ranges (variants) and the
>> same samples. cbind is appropriate for the same ranges and different
>> samples.
>>
>> Let me know if you run into problems/bugs.
>>
>> Valerie
>>
>>
>> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote:
>>> Hi all,
>>>
>>> Does VariantAnnotation currently have a method to merge VCF objects?
>>> I've been looking through the documentation and code and haven't found
>>> anything like this. If not, I think it would be a useful feature to add.
>>>
>>> My use case: I have two VCF files, with the same samples (but in
>>> different order in each file). The two files have non-overlapping
>>> variants. I would love to have an rbind(VCF, VCF) method; then I could
>>> do something like:
>>>
>>> vcf2 <- vcf2[,colnames(vcf1)]
>>> vcf <- rbind(vcf1, vcf2)
>>>
>>> cbind() would also be useful, for combining files with the same variants
>>> but different samples.
>>>
>>> thanks,
>>> Stephanie
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
More information about the Bioconductor
mailing list