[Bioc-devel] Can not import gapped bam, Rsamtools bug or my parameters?

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 29 04:40:12 CEST 2012


On 04/28/2012 07:23 PM, Martin Morgan wrote:
> On 04/28/2012 07:01 PM, Jesper Gådin wrote:
>> Hi Everyone!
>> I have problems with importing a Bamfile that has its reads aligned,
>> which
>> has gaps.
>> Thought that it only was to change the isUnmappedQuery to FALSE, but
>> looking at my result, that is not the case.
>> Read 1 and 9 (see results below) have gaps and gets wrong positions in
>> the
>> imported GRanges Object.
>> Is it a bug in Rsamtools or have I set the parameters wrong?
>
> Hi Jesper -- I can't tell what your alignment is; can you use samtools
> to view record ERR031838.7783934 and ERR031838.58134419, or use
> filterBam to create a very small bam file with just the reads that are a
> problem?
>
> Also, maybe GenomicRanges::readGappedAlignments does some of the work
> for you?

Sorry, looking at your code a little more closely...

                  ranges<-IRanges(
                          start=bam[[rangeName]][["pos"]],
                          width=bam[[rangeName]][["qwidth"]]
                  )

isn't correct -- qwidth is the width of the query (short read), not the 
width of the alignment. The width of the alignment needs to be 
calculated using the cigar; see GenomicRanges::cigarToWidth. But 
readGappedAlignments (also readBamGappedAlignments / readBamGappedReads) 
will do this work for you.

Martin


>
> Martin

>
>>
>> #Below the script that can reproduce the result I got
>> rm(list=ls())
>>
>> library(Biostrings)
>> library(IRanges)
>> library(GenomicRanges)
>> library(Rsamtools)
>>
>>
>> BamImpGRList<-function(UserDir,searchArea,opt=TRUE){
>> #Set parameters
>> which<- searchArea #A GRanges, RangesList, RangedData, or missing
>> object, from which a IRangesList instance will be constructed.
>> what<- scanBamWhat() #A character vector naming the fields to return.
>> scanBamWhat() returns a vector of available fields. Fields are
>> described on
>> the scanBam help page.
>> flag<- scanBamFlag(isUnmappedQuery=FALSE)
>> #scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = FALSE,
>> hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
>> isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA,
>> isValidVendorRead = NA, isDuplicate = NA)
>>
>> param<-ScanBamParam(flag=flag,which=which,what=what) #store
>> ScanBamParam,which and what in param.
>>
>> #Point to correct directory and create a BamFileList object
>> bamDir<- normalizePath(UserDir) #Point to the directory containing
>> your Bam files and its respective bam.bai files.
>> allFiles<- list.files(bamDir,full.names = TRUE) #list files in a
>> folder.
>> bamFiles<- allFiles[grep(".bam$",allFiles)] #list only the files
>> ending in .bam .
>> bamFilesList<- BamFileList(bamFiles) #store all the .bam paths in a
>> BamFile.
>>
>> #Loop through, open scanBam, store in GRList and then close each object
>> in the BamFileList object.
>> BamGRL<-GRangesList()
>> i<-1
>>
>> for(bamName in names(bamFilesList)) {
>> #Description
>> bf<- bamFilesList[[bamName]]
>> open(bf)
>> print(paste("Reading bam file",i,"with
>> filename",basename(bamName))) #Print information to the user
>> bam<-scanBam(bf,param=param)
>> #Description
>> for(rangeName in names(bam)){
>>
>>
>>
>> ranges<-IRanges(
>> start=bam[[rangeName]][["pos"]], #if NA values your
>> in trouble. That means the read didnt map. Use better filter.
>> width=bam[[rangeName]][["qwidth"]]
>> )
>> GRangeBam<-GRanges(
>> seqnames =
>> as.character(bam[[rangeName]][["rname"]]), #Before "mrnm", now"rname"...
>> ranges = ranges,
>> strand = bam[[rangeName]][["strand"]],
>> names=bam[[rangeName]][["qname"]],
>> flag=bam[[rangeName]][["flag"]],
>> cigar=bam[[rangeName]][["cigar"]],
>> mapq=bam[[rangeName]][["cigar"]],
>> mpos=bam[[rangeName]][["mpos"]],
>> isize=bam[[rangeName]][["isize"]],
>> seq=bam[[rangeName]][["seq"]],
>> qual=bam[[rangeName]][["qual"]]
>> )
>>
>> #Store GRangeBam in BamGRL (which is the GRange List
>> object)
>> if(basename(bamName)%in%names(BamGRL)){ #This way of
>> merging the different chromosomes to the same GRangeObject is maybe
>> not the
>> best way. Improve later.
>>
>> BamGRL[[basename(bamName)]]<-c(BamGRL[[basename(bamName)]],GRangeBam)
>> }
>> else {
>> BamGRL[[basename(bamName)]]<-GRangeBam
>> }
>> }
>> print(paste("stored",basename(bamName), "in BamGRL"))
>> i<-1+i
>> gc()
>> close(bf)
>> }
>> return(BamGRL)
>>
>> }
>>
>> #Construct SearchArea
>> IRanges<-IRanges(
>> c(109852192,57522276,216225163),
>> c(109940573,57607134,216300895),
>> )
>> searchArea2<-GRanges(
>> seqnames = c("chr1","chr12","chr2"),
>> ranges = IRanges
>> )
>>
>> #Import and select region of interest
>> BamGRL<-BamImpGRList("../bams",searchArea2)
>> ranges<- IRanges(start=216269079, width=1)
>> GR<- GRanges(seqnames="chr2",ranges)
>> BamGRLplus<- BamGRL[strand(BamGRL)=="+"]
>> subsetReads<- subsetByOverlaps(BamGRLplus[[1]],GR)
>>
>> #Display the part that has the wrong positions.
>> subsetReads
>>
>>
>> ###########Below the result###########
>>> subsetReads
>> GRanges with 159 ranges and 8 elementMetadata cols:
>> seqnames ranges strand | names
>> flag
>> <Rle> <IRanges> <Rle> |<character>
>> <integer>
>> [1] chr2 [216268992, 216269081] + |
>> ERR031838.7783934 99
>> [2] chr2 [216268992, 216269081] + |
>> ERR031838.49313802 99
>> [3] chr2 [216268995, 216269084] + |
>> ERR031838.17036310 99
>> [4] chr2 [216268996, 216269085] + |
>> ERR031838.57612722 99
>> [5] chr2 [216268997, 216269086] + | ERR031838.33383381
>> 163
>> [6] chr2 [216268997, 216269086] + | ERR031838.70228271
>> 163
>> [7] chr2 [216268999, 216269088] + |
>> ERR031838.12543963 99
>> [8] chr2 [216268999, 216269088] + | ERR031838.40297870
>> 163
>> [9] chr2 [216268999, 216269088] + |
>> ERR031838.58134419 99
>>
>>
>> seq
>>
>> <DNAStringSet>
>> [1]
>> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGT
>>
>> [2]
>> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAG
>>
>> [3]
>> ATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTA
>>
>> [4]
>> TTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAA
>>
>> [5]
>> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAG
>>
>> [6]
>> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTAAG
>>
>> [7]
>> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
>>
>> [8]
>> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
>>
>> [9]
>> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGA
>>
>>
>>
>>
>>
>> #SessionInfo
>>> sessionInfo()
>> R version 2.16.0 Under development (unstable) (--)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Rsamtools_1.7.40 GenomicRanges_1.9.7 Biostrings_2.25.2
>> [4] IRanges_1.15.4 BiocGenerics_0.3.0
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-4.1 stats4_2.16.0 zlibbioc_1.3.0
>>
>>
>> /Jesper
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-devel mailing list