[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Martin Morgan mtmorgan at fhcrc.org
Tue Apr 22 03:17:10 CEST 2014


On 04/21/2014 02:33 PM, Jonathon Hill wrote:
> Hi,
>
> I have been trying to use Rsamtools’ applyPileups function to compare two files position-by-position. In order to test it out, I simply ran:
> minBaseQuality <- 20
> minMapQuality <- 30
> minDepth <- 10
> maxDepth <- 1000
> testparam <- PileupParam(what="seq",
>                           which=GRanges(“chr20", IRanges(1, 1000000)),
>                           minBaseQuality=minBaseQuality,
>                           minMapQuality=minMapQuality,
>                           minDepth=minDepth,
>                           maxDepth=maxDepth,
> )
> fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
> fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>
> My understanding is that this should result in a three-dimensional array with “ACTGN Counts” in the first dimension, files in the second and position in the third. Positions with overlapping reads in both files should thus be collapsed into a single line in the third dimension. However, selecting one of these positions shows that they are duplicated:
>
> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

I think your understanding is basically correct.

The function is assuming that the BAM files are sorted by position (with, e.g., 
sortBam, but the files don't have to be sorted by Rsamtools).

Executing a similar command gives me

 > str(r3[[1]])
List of 3
  $ seqnames: Named int 211195
   ..- attr(*, "names")= chr "chr20"
  $ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033 
60034 60035 ...
  $ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
   ..- attr(*, "dimnames")=List of 3
   .. ..$ : chr [1:5] "A" "C" "G" "T" ...
   .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
   .. ..$ : NULL

Do you get something similar, especially the identical seqnames, pos dimension, 
and third dimension of seq? 'pos' should apparently be unique; so

 > any(duplicated(r3[[1]][["pos"]]))
[1] FALSE

If there are duplicates, I wonder how many there are and where they occur

   pos = r3[[1]][["pos"]]
   table(table(pos))
   udpos = unique(pos[duplicated(pos)])
   head(pos[match(pos, udpos)], 20)
   head(match(pos, udpos), 20)

If nothing is suggested by the above, can you make a subset of the BAM files 
available to me, e.g., the result of

   param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
   filterBam(fls[1], tempfile(), param=param)

Thanks,

Martin

> yields:
>
> , , 1
>
>    10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
> A                                                   0                                                   0
> C                                                   0                                                   0
> G                                                  10                                                   0
> T                                                   0                                                   0
> N                                                   0                                                   0
>
> , , 2
>
>    10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
> A                                                   0                                                   0
> C                                                   0                                                   0
> G                                                   0                                                  13
> T                                                   0                                                   0
> N                                                   0                                                   0
>
> Even though the position is the same, it is showing up twice. Each time, one of the files shows zeroes. This is not consistent with what happens if the files are identical (as in the example from the help docs).
>
> For example,
>
> r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam) #file 1 entered twice
>   r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>
> yields:
>
>    10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
> A                                                   0                                                   0
> C                                                   0                                                   0
> G                                                  10                                                  10
> T                                                   0                                                   0
> N                                                   0                                                   0
>
> Is this the expected behavior? It seems like each position should only show up once in the output. Is there something I am missing?
>
> Thanks,
>
> Jonathon Hill
> Postdoc
> Yost Lab, University of Utah
>
>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list