[Bioc-devel] Can not import gapped bam, Rsamtools bug or my parameters?

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 29 04:23:59 CEST 2012


On 04/28/2012 07:01 PM, Jesper Gådin wrote:
> Hi Everyone!
> I have problems with importing a Bamfile that has its reads aligned, which
> has gaps.
> Thought that it only was to change the isUnmappedQuery to FALSE, but
> looking at my result, that is not the case.
> Read 1 and 9 (see results below) have gaps and gets wrong positions in the
> imported GRanges Object.
> Is it a bug in Rsamtools or have I set the parameters wrong?

Hi Jesper -- I can't tell what your alignment is; can you use samtools 
to view record ERR031838.7783934 and ERR031838.58134419, or use 
filterBam to create a very small bam file with just the reads that are a 
problem?

Also, maybe GenomicRanges::readGappedAlignments does some of the work 
for you?

Martin

>
> #Below the script that can reproduce the result I got
> rm(list=ls())
>
> library(Biostrings)
> library(IRanges)
> library(GenomicRanges)
> library(Rsamtools)
>
>
> BamImpGRList<-function(UserDir,searchArea,opt=TRUE){
>      #Set parameters
>      which<- searchArea #A GRanges, RangesList, RangedData, or missing
> object, from which a IRangesList instance will be constructed.
>      what<- scanBamWhat() #A character vector naming the fields to return.
> scanBamWhat() returns a vector of available fields. Fields are described on
> the scanBam help page.
>      flag<- scanBamFlag(isUnmappedQuery=FALSE)
>      #scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = FALSE,
> hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
> isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA,
> isValidVendorRead = NA, isDuplicate = NA)
>
>      param<-ScanBamParam(flag=flag,which=which,what=what) #store
> ScanBamParam,which and what in param.
>
>      #Point to correct directory and create a BamFileList object
>      bamDir<- normalizePath(UserDir) #Point to the directory containing
> your Bam files and its respective bam.bai files.
>      allFiles<- list.files(bamDir,full.names = TRUE) #list files in a
> folder.
>      bamFiles<- allFiles[grep(".bam$",allFiles)]     #list only the files
> ending in .bam .
>      bamFilesList<- BamFileList(bamFiles)    #store all the .bam paths in a
> BamFile.
>
>      #Loop through, open scanBam, store in GRList and then close each object
> in the BamFileList object.
>      BamGRL<-GRangesList()
>      i<-1
>
>      for(bamName in names(bamFilesList)) {
>          #Description
>          bf<- bamFilesList[[bamName]]
>          open(bf)
>          print(paste("Reading bam file",i,"with
> filename",basename(bamName)))    #Print information to the user
>          bam<-scanBam(bf,param=param)
>          #Description
>          for(rangeName in names(bam)){
>
>
>
>                  ranges<-IRanges(
>                          start=bam[[rangeName]][["pos"]], #if NA values your
> in trouble. That means the read didnt map. Use better filter.
>                          width=bam[[rangeName]][["qwidth"]]
>                  )
>                  GRangeBam<-GRanges(
>                          seqnames =
> as.character(bam[[rangeName]][["rname"]]),   #Before "mrnm", now"rname"...
>                          ranges = ranges,
>                          strand = bam[[rangeName]][["strand"]],
>                          names=bam[[rangeName]][["qname"]],
>                          flag=bam[[rangeName]][["flag"]],
>                          cigar=bam[[rangeName]][["cigar"]],
>                          mapq=bam[[rangeName]][["cigar"]],
>                          mpos=bam[[rangeName]][["mpos"]],
>                          isize=bam[[rangeName]][["isize"]],
>                          seq=bam[[rangeName]][["seq"]],
>                          qual=bam[[rangeName]][["qual"]]
>                  )
>
>                  #Store GRangeBam in BamGRL (which is the GRange List
> object)
>                  if(basename(bamName)%in%names(BamGRL)){  #This way of
> merging the different chromosomes to the same GRangeObject is maybe not the
> best way. Improve later.
>
> BamGRL[[basename(bamName)]]<-c(BamGRL[[basename(bamName)]],GRangeBam)
>                  }
>                  else {
>                      BamGRL[[basename(bamName)]]<-GRangeBam
>                  }
>          }
>          print(paste("stored",basename(bamName), "in BamGRL"))
>          i<-1+i
>          gc()
>          close(bf)
>      }
>                  return(BamGRL)
>
> }
>
> #Construct SearchArea
>      IRanges<-IRanges(
>              c(109852192,57522276,216225163),
>              c(109940573,57607134,216300895),
>      )
>      searchArea2<-GRanges(
>              seqnames = c("chr1","chr12","chr2"),
>              ranges = IRanges
>      )
>
> #Import and select region of interest
> BamGRL<-BamImpGRList("../bams",searchArea2)
> ranges<- IRanges(start=216269079, width=1)
> GR<- GRanges(seqnames="chr2",ranges)
> BamGRLplus<- BamGRL[strand(BamGRL)=="+"]
> subsetReads<- subsetByOverlaps(BamGRLplus[[1]],GR)
>
> #Display the part that has the wrong positions.
> subsetReads
>
>
> ###########Below the result###########
>> subsetReads
> GRanges with 159 ranges and 8 elementMetadata cols:
>          seqnames                 ranges strand   |              names
> flag
>             <Rle>               <IRanges>   <Rle>    |<character>
> <integer>
>      [1]     chr2 [216268992, 216269081]      +   |
> ERR031838.7783934        99
>      [2]     chr2 [216268992, 216269081]      +   |
> ERR031838.49313802        99
>      [3]     chr2 [216268995, 216269084]      +   |
> ERR031838.17036310        99
>      [4]     chr2 [216268996, 216269085]      +   |
> ERR031838.57612722        99
>      [5]     chr2 [216268997, 216269086]      +   | ERR031838.33383381
> 163
>      [6]     chr2 [216268997, 216269086]      +   | ERR031838.70228271
> 163
>      [7]     chr2 [216268999, 216269088]      +   |
> ERR031838.12543963        99
>      [8]     chr2 [216268999, 216269088]      +   | ERR031838.40297870
> 163
>      [9]     chr2 [216268999, 216269088]      +   |
> ERR031838.58134419        99
>
>
> seq
>
> <DNAStringSet>
>      [1]
> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGT
>      [2]
> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAG
>      [3]
> ATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTA
>      [4]
> TTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAA
>      [5]
> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAG
>      [6]
> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTAAG
>      [7]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
>      [8]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
>      [9]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGA
>
>
>
>
> #SessionInfo
>> sessionInfo()
> R version 2.16.0 Under development (unstable) (--)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Rsamtools_1.7.40    GenomicRanges_1.9.7 Biostrings_2.25.2
> [4] IRanges_1.15.4      BiocGenerics_0.3.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-4.1 stats4_2.16.0  zlibbioc_1.3.0
>
>
> /Jesper
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-devel mailing list