[BioC] Complementing alleles in a VCF
Alex Gutteridge
alexg at ruggedtextile.com
Thu Apr 19 11:39:12 CEST 2012
Hi All,
I'm having some difficulty with VCF handling and feeding 1000 genome
VCFs into predictCoding from the VariantAnnotation package. Can anyone
help?
When I read 1000 genomes VCFs the SNPs come through with no strand
specified e.g (where vcf.file and param are a 1000 genomes VCF file and
a GRanges respectively):
> vcf = readVcf(vcf.file, "hg19", param)
> fixed(vcf)
GRanges with 2610 ranges and 5 elementMetadata cols:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
rs186838828 chr2 [167051756, 167051756] * | SCN9A
rs191667986 chr2 [167051865, 167051865] * | SCN9A
rs139483482 chr2 [167051900, 167051900] * | SCN9A
rs182687583 chr2 [167052080, 167052080] * | SCN9A
rs115766730 chr2 [167052144, 167052144] * | SCN9A
rs186613025 chr2 [167052168, 167052168] * | SCN9A
rs73017538 chr2 [167052328, 167052328] * | SCN9A
rs114327563 chr2 [167052375, 167052375] * | SCN9A
rs191401619 chr2 [167052418, 167052418] * | SCN9A
... ... ... ... ... ...
rs73025590 chr2 [167231750, 167231750] * | SCN9A
rs141453198 chr2 [167231812, 167231812] * | SCN9A
rs16852069 chr2 [167231890, 167231890] * | SCN9A
rs181276399 chr2 [167231932, 167231932] * | SCN9A
rs185839773 chr2 [167232251, 167232251] * | SCN9A
rs191091185 chr2 [167232439, 167232439] * | SCN9A
rs148362057 chr2 [167232446, 167232446] * | SCN9A
rs141521157 chr2 [167232450, 167232450] * | SCN9A
rs1881440 chr2 [167232463, 167232463] * | SCN9A
REF ALT QUAL FILTER
<DNAStringSet> <DNAStringSetList> <numeric> <character>
rs186838828 T ######## 100 PASS
rs191667986 T ######## 100 PASS
rs139483482 T ######## 100 PASS
rs182687583 G ######## 100 PASS
rs115766730 G ######## 100 PASS
rs186613025 A ######## 100 PASS
rs73017538 A ######## 100 PASS
rs114327563 A ######## 100 PASS
rs191401619 C ######## 100 PASS
... ... ... ... ...
rs73025590 A ######## 100 PASS
rs141453198 C ######## 100 PASS
rs16852069 A ######## 100 PASS
rs181276399 G ######## 100 PASS
rs185839773 A ######## 100 PASS
rs191091185 C ######## 100 PASS
rs148362057 A ######## 100 PASS
rs141521157 A ######## 100 PASS
rs1881440 C ######## 100 PASS
---
seqlengths:
chr2
NA
When I feed these to predictCoding(), genes on the complement strand
are not dealt with correctly - the ALT alleles given in the VCF are not
complemented first, so the variant codons are not translated right.
> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
> aa = predictCoding(vcf.filt, txdb, Hsapiens)
It seems like predictCoding should be able to deal with genes on the
complement strand, but maybe it requires the strand to be set correctly
in the VCF? I've tried and failed to find a way of editing the VCF - if
nothing else to just complement the ALT alleles before running
predictcoding, but it doesn't seem to work. Though the manual seems to
suggest it is possible. I didn't find a way to actually set the strand
directly either:
‘alt(x)’, ‘alt(x) <- value’: Returns or sets the alternate allele data
from the ALT column of the VCF file. ‘value’ can be a
‘DNAStringSet’ or a ‘CharacterList’ (for a structural VCF
file).
> tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]]))
> tmp.alt
A DNAStringSet instance of length 2610
width seq
[1] 1 T
[2] 1 G
[3] 1 G
[4] 1 T
[5] 1 T
[6] 1 C
[7] 1 C
[8] 1 C
[9] 1 C
... ... ...
[2602] 1 A
[2603] 1 T
[2604] 1 C
[2605] 1 T
[2606] 1 C
[2607] 1 C
[2608] 1 C
[2609] 1 C
[2610] 1 T
> alt(vcf) = tmp.alt
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "alt<-", for
signature "VCF", "DNAStringSet"
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.0-5
[2] caTools_1.12
[3] bitops_1.0-4.1
[4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
[5] GenomicFeatures_1.8.1
[6] org.Hs.eg.db_2.7.1
[7] RSQLite_0.11.1
[8] DBI_0.2-5
[9] AnnotationDbi_1.18.0
[10] Biobase_2.16.0
[11] VariantAnnotation_1.2.5
[12] Rsamtools_1.8.3
[13] BSgenome.Hsapiens.UCSC.hg19_1.3.17
[14] BSgenome_1.24.0
[15] Biostrings_2.24.1
[16] GenomicRanges_1.8.3
[17] IRanges_1.14.2
[18] BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6
Matrix_1.0-6
[5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0
splines_2.15.0
[9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4
[13] zlibbioc_1.2.0
--
Alex Gutteridge
More information about the Bioconductor
mailing list