[BioC] Filtering BAM files by start position for VariantTools
Valerie Obenchain
vobencha at fhcrc.org
Thu Jul 11 19:02:18 CEST 2013
Hi Sean,
As you've discovered, the 'which' in the 'param' (for reading bam files)
specifies positions to overlap, not start or end. One approach to
isolating reads with a specific starting position would be to filter the
bam by 'pos'.
library(VariantTools)
fl <- LungCancerLines::LungCancerBamFiles()$H1993
mystart <- 1110426
filt <- list(setStart=function(x) x$pos %in% mystart)
dest <- tempfile()
filterBam(fl, dest, filter=FilterRules(filt))
scn <- scanBam(dest)
Confirm all reads start with 'mystart':
> table(scn[[1]]$pos)
1110426
2388
If you want a tally of all nucleotides for all sequences starting with
'mystart' then no need to supply a 'which':
param <- VariantTallyParam(gmapR::TP53Genome(),
readlen=100L,
high_base_quality=23L)
tally <- tallyVariants(fl, param)
Valerie
On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site.
>
> #Example code:
>> bams <- LungCancerLines::LungCancerBamFiles()
>> bam <- bams$H1993
>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+')
>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> + readlen = 100L,
> + high_base_quality = 23L,
> + which = which)
>> raw.variants <- tallyVariants(bam, tally.param)
>
> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426.
>
> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode:
>> raw.variants<-lapply (start(bam), function (x){
> which<-GRanges(seqnames=c('chrM'), '+', start=x)
> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
> tallyVariants(bamfile, tally.param)
> })
>
> Thanks,
> Sean
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] grid parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2
> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5
> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24
> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8
> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6
> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4
> [19] IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.16.0 bitops_1.0-5
> [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19
> [5] graph_1.38.2 Matrix_1.0-12
> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2
> [9] RCurl_1.95-4.1 rtracklayer_1.20.2
> [11] stats4_3.0.1 tools_3.0.1
> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1
> [15] zlibbioc_1.6.0
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list