[Bioc-devel] Can not import gapped bam, Rsamtools bug or my parameters?
Martin Morgan
mtmorgan at fhcrc.org
Sun Apr 29 04:23:59 CEST 2012
On 04/28/2012 07:01 PM, Jesper Gådin wrote:
> Hi Everyone!
> I have problems with importing a Bamfile that has its reads aligned, which
> has gaps.
> Thought that it only was to change the isUnmappedQuery to FALSE, but
> looking at my result, that is not the case.
> Read 1 and 9 (see results below) have gaps and gets wrong positions in the
> imported GRanges Object.
> Is it a bug in Rsamtools or have I set the parameters wrong?
Hi Jesper -- I can't tell what your alignment is; can you use samtools
to view record ERR031838.7783934 and ERR031838.58134419, or use
filterBam to create a very small bam file with just the reads that are a
problem?
Also, maybe GenomicRanges::readGappedAlignments does some of the work
for you?
Martin
>
> #Below the script that can reproduce the result I got
> rm(list=ls())
>
> library(Biostrings)
> library(IRanges)
> library(GenomicRanges)
> library(Rsamtools)
>
>
> BamImpGRList<-function(UserDir,searchArea,opt=TRUE){
> #Set parameters
> which<- searchArea #A GRanges, RangesList, RangedData, or missing
> object, from which a IRangesList instance will be constructed.
> what<- scanBamWhat() #A character vector naming the fields to return.
> scanBamWhat() returns a vector of available fields. Fields are described on
> the scanBam help page.
> flag<- scanBamFlag(isUnmappedQuery=FALSE)
> #scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = FALSE,
> hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
> isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA,
> isValidVendorRead = NA, isDuplicate = NA)
>
> param<-ScanBamParam(flag=flag,which=which,what=what) #store
> ScanBamParam,which and what in param.
>
> #Point to correct directory and create a BamFileList object
> bamDir<- normalizePath(UserDir) #Point to the directory containing
> your Bam files and its respective bam.bai files.
> allFiles<- list.files(bamDir,full.names = TRUE) #list files in a
> folder.
> bamFiles<- allFiles[grep(".bam$",allFiles)] #list only the files
> ending in .bam .
> bamFilesList<- BamFileList(bamFiles) #store all the .bam paths in a
> BamFile.
>
> #Loop through, open scanBam, store in GRList and then close each object
> in the BamFileList object.
> BamGRL<-GRangesList()
> i<-1
>
> for(bamName in names(bamFilesList)) {
> #Description
> bf<- bamFilesList[[bamName]]
> open(bf)
> print(paste("Reading bam file",i,"with
> filename",basename(bamName))) #Print information to the user
> bam<-scanBam(bf,param=param)
> #Description
> for(rangeName in names(bam)){
>
>
>
> ranges<-IRanges(
> start=bam[[rangeName]][["pos"]], #if NA values your
> in trouble. That means the read didnt map. Use better filter.
> width=bam[[rangeName]][["qwidth"]]
> )
> GRangeBam<-GRanges(
> seqnames =
> as.character(bam[[rangeName]][["rname"]]), #Before "mrnm", now"rname"...
> ranges = ranges,
> strand = bam[[rangeName]][["strand"]],
> names=bam[[rangeName]][["qname"]],
> flag=bam[[rangeName]][["flag"]],
> cigar=bam[[rangeName]][["cigar"]],
> mapq=bam[[rangeName]][["cigar"]],
> mpos=bam[[rangeName]][["mpos"]],
> isize=bam[[rangeName]][["isize"]],
> seq=bam[[rangeName]][["seq"]],
> qual=bam[[rangeName]][["qual"]]
> )
>
> #Store GRangeBam in BamGRL (which is the GRange List
> object)
> if(basename(bamName)%in%names(BamGRL)){ #This way of
> merging the different chromosomes to the same GRangeObject is maybe not the
> best way. Improve later.
>
> BamGRL[[basename(bamName)]]<-c(BamGRL[[basename(bamName)]],GRangeBam)
> }
> else {
> BamGRL[[basename(bamName)]]<-GRangeBam
> }
> }
> print(paste("stored",basename(bamName), "in BamGRL"))
> i<-1+i
> gc()
> close(bf)
> }
> return(BamGRL)
>
> }
>
> #Construct SearchArea
> IRanges<-IRanges(
> c(109852192,57522276,216225163),
> c(109940573,57607134,216300895),
> )
> searchArea2<-GRanges(
> seqnames = c("chr1","chr12","chr2"),
> ranges = IRanges
> )
>
> #Import and select region of interest
> BamGRL<-BamImpGRList("../bams",searchArea2)
> ranges<- IRanges(start=216269079, width=1)
> GR<- GRanges(seqnames="chr2",ranges)
> BamGRLplus<- BamGRL[strand(BamGRL)=="+"]
> subsetReads<- subsetByOverlaps(BamGRLplus[[1]],GR)
>
> #Display the part that has the wrong positions.
> subsetReads
>
>
> ###########Below the result###########
>> subsetReads
> GRanges with 159 ranges and 8 elementMetadata cols:
> seqnames ranges strand | names
> flag
> <Rle> <IRanges> <Rle> |<character>
> <integer>
> [1] chr2 [216268992, 216269081] + |
> ERR031838.7783934 99
> [2] chr2 [216268992, 216269081] + |
> ERR031838.49313802 99
> [3] chr2 [216268995, 216269084] + |
> ERR031838.17036310 99
> [4] chr2 [216268996, 216269085] + |
> ERR031838.57612722 99
> [5] chr2 [216268997, 216269086] + | ERR031838.33383381
> 163
> [6] chr2 [216268997, 216269086] + | ERR031838.70228271
> 163
> [7] chr2 [216268999, 216269088] + |
> ERR031838.12543963 99
> [8] chr2 [216268999, 216269088] + | ERR031838.40297870
> 163
> [9] chr2 [216268999, 216269088] + |
> ERR031838.58134419 99
>
>
> seq
>
> <DNAStringSet>
> [1]
> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGT
> [2]
> CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAG
> [3]
> ATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTA
> [4]
> TTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAA
> [5]
> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAG
> [6]
> TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTAAG
> [7]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
> [8]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA
> [9]
> AAATTATGTACTAATTTTTTAATGTTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGA
>
>
>
>
> #SessionInfo
>> sessionInfo()
> R version 2.16.0 Under development (unstable) (--)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Rsamtools_1.7.40 GenomicRanges_1.9.7 Biostrings_2.25.2
> [4] IRanges_1.15.4 BiocGenerics_0.3.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-4.1 stats4_2.16.0 zlibbioc_1.3.0
>
>
> /Jesper
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-devel
mailing list