[BioC] overlap GRangesList and vcf
Valerie Obenchain
vobencha at fhcrc.org
Sat Aug 11 22:10:34 CEST 2012
Hi Georg,
The readVcf() function in the VariantAnnotation package reads data from
a VCF file into a VCF-class object,
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
> vcf
class: VCF
dim: 10376 5
genome: hg19
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(22): LDAF AVGPOST ... VT SNPSOURCE
geno(3): GT DS GL
rownames(10376): rs7410291 rs147922003 ... rs144055359 rs114526001
rowData values names(1): paramRangeID
colnames(5): HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1): Samples
For details of the class slots and data accessors,
?'VCF-class'
Before counting, make sure the chromosome names are consistent between
the vcf and annotation. Here the annotation precedes chromosome numbers
with 'chr' and the vcf file does not.
> intersect(seqlevels(txdb), seqlevels(vcf))
character(0)
We modify the vcf seqlevels to match the txdb using renameSeqlevels(),
old <- seqlevels(vcf)
new <- paste("chr", old, sep="")
names(new) <- old
vcf <- renameSeqlevels(vcf, new)
> intersect(seqlevels(txdb), seqlevels(vcf))
[1] "chr22"
The rowData slot contains a GRanges of positions which can be overlapped
with your exons by genes,
rd <- rowData(vcf)
You wanted a list of exons hit by variants so we will unlist the
GRangesList. If instead you want the hits grouped by gene, don't
unlist(exbygn).
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exbygn <- exonsBy(txdb, "gene")
exons <- unlist(exbygn)
Counting can be done with findOverlaps() or summarizeOverlaps(). See the
man pages for details on how the counting is performed.
summarizedOverlaps() is written with read gaps in mind and therefore
requires the 'reads' argument to be a Bam file or GappedAlignements
object. Since you are interested in SNPs for this example we'll use
findOverlaps().
fo <- findOverlaps(exons, rd)
Exons hit by variants are extracted from 'exons' with the 'queryHits'
column,
exonsHit <- exons[queryHits(fo)]
and the corresponding variants are subset from the VCF-class object with
the 'subjectHits' column.
vcfHit <- vcf[subjectHits(fo)]
The positions of the SNPs are in the GRanges in the rowData slot,
rowData(vcfHit)
If you wanted your hits grouped by gene, don't unlist(exbygn) before
counting.
Valerie
On 08/10/12 09:55, Georg Otto wrote:
> Hi,
>
>
> I apologise if this is trivial, but I haven't found a straight forward
> way to do this so far.
>
> Given a GRangesList with exons
>
> ## exons by gene
> range.exon<- exonsBy(txdb.ensgene, "gene")
>
> and a vcf file with SNP data (positions, alleles, etc), how can I
> generate a list of exons that contain SNPs (along with SNP positions)?
>
> Any hint will be appreciated!
>
> Georg
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list