[BioC] mutant allele read counts

Fri Jun 13 21:29:09 CEST 2014

Hi,

Use readVcfAsVRanges() then coerce to a data.frame.

fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation")
vr <- readVcfAsVRanges(fl, "hg19")
df <- as.data.frame(vr)

You'll have some extra columns in the data.frame but you can remove / 
rename columns as necessary.

Valerie

On 06/13/2014 10:46 AM, Murli [guest] wrote:
> Hi,
> I am interested in extracting information for functional annotation using CRAVAT. It requires the data to be in the following format.
> ===========================================
> # UID / Chr. / Position / Strand / Ref. base / Alt. base / Sample ID (optional)
> TR1	chr17	7577506	-	G	T	TCGA-02-0231
> TR2	chr10	123279680	-	G	A	TCGA-02-3512
> TR3	chr13	49033967	+	C	A	TCGA-02-3532
> TR4	chr7	116417505	+	G	T	TCGA-02-1523
> TR5	chr7	140453136	-	T	A	TCGA-02-0023
> TR6	chr17	37880998	+	G	T	TCGA-02-0252
> Ins1 chr17	37880998	+	G	GT	TCGA-02-0252
> Del1 chr17	37880998	+	GA	G	TCGA-02-0252
> CSub1 chr2	39871235	+	ATGCT	GA	TCGA-02-0252
>
> ===============================================
> http://www.cravat.us/help.jsp?chapter=how_to_cite&article=#
>
> I am trying to extract this information from vcf files generated by mutect. I am using VariantAnnotation extract this information. I have read the file using readVcf(), and renamed the chromosomes according to txdb.
>
> rowData(newVcfData)
> GRanges with 62991 ranges and 5 metadata columns:
>                    seqnames                 ranges strand   | paramRangeID
>                       <Rle>              <IRanges>  <Rle>   |     <factor>
>       1:109641_A/G     chr1       [109641, 109641]      *   |         <NA>
>       1:526561_T/G     chr1       [526561, 526561]      *   |         <NA>
>       1:691958_G/A     chr1       [691958, 691958]      *   |         <NA>
>       1:763781_A/T     chr1       [763781, 763781]      *   |         <NA>
>          rs6594026     chr1       [782981, 782981]      *   |         <NA>
>                ...      ...                    ...    ... ...          ...
>           rs480725     chrX [154903224, 154903224]      *   |         <NA>
>    X:154925893_C/T     chrX [154925893, 154925893]      *   |         <NA>
>    X:155038107_C/G     chrX [155038107, 155038107]      *   |         <NA>
>    X:155204257_G/T     chrX [155204257, 155204257]      *   |         <NA>
>    X:155234730_T/C     chrX [155234730, 155234730]      *   |         <NA>
>                               REF                ALT      QUAL      FILTER
>                    <DNAStringSet> <DNAStringSetList> <numeric> <character>
>       1:109641_A/G              A                  G      8.90        PASS
>       1:526561_T/G              T                  G      9.19        PASS
>       1:691958_G/A              G                  A     13.74        PASS
>       1:763781_A/T              A                  T     16.03        PASS
>          rs6594026              C                  T     11.24        PASS
>                ...            ...                ...       ...         ...
>           rs480725              A                  T      6.39        PASS
>    X:154925893_C/T              C                  T      6.53        PASS
>    X:155038107_C/G              C                  G      6.64        PASS
>    X:155204257_G/T              G                  T      6.35        PASS
>    X:155234730_T/C              T                  C      6.51        PASS
>    ---
>    seqlengths:
>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr5  chr6  chr7  chr8  chr9  chrX
>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA
>
>
> Can the information be extracted using VariantAnnotation()? I would appreciate your help with this.
> Thanks ../Murli
>
>
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1
>   [2] GenomicFeatures_1.14.5
>   [3] AnnotationDbi_1.24.0
>   [4] Biobase_2.22.0
>   [5] VariantAnnotation_1.8.13
>   [6] Rsamtools_1.14.3
>   [7] Biostrings_2.30.1
>   [8] GenomicRanges_1.14.4
>   [9] XVector_0.2.0
> [10] IRanges_1.20.7
> [11] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.18.0     bitops_1.0-6       BSgenome_1.30.0    DBI_0.2-7
>   [5] RCurl_1.95-4.1     RSQLite_0.11.4     rtracklayer_1.22.7 stats4_3.0.2
>   [9] tools_3.0.2        XML_3.98-1.1       zlibbioc_1.8.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>

-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158