[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Jonathon Hill jhill at genetics.utah.edu
Tue Apr 22 16:44:04 CEST 2014


Hi Martin,

Thank you for your response. I checked the header and it says that it is coordinate sorted, so that shouldn’t be the problem. Here are the results of the code you provided:

> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
> any(duplicated(r3[[1]][["pos"]]))
[1] TRUE
> pos = r3[[1]][["pos"]]
> table(table(pos))

   1    2
4834 3115
> udpos = unique(pos[duplicated(pos)])
> head(pos[match(pos, udpos)], 20)
 [1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016 135017 135018 135019 135020 135021 135022 135023
[20] 135024
> head(match(pos, udpos), 20)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> table(pos)[1:50]
pos
134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772 134773 134774 134775 134776 134777 134778 134779 134780
     1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1
134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993 134994 134995 134999 135001 135002 135003 135004 135005
     1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      2      1      1
135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016 135017
     2      2      2      2      2      2      2      2      2      2      2      2

I think the last one shows it well. There are duplicates throughout the file, anywhere that there are reads in both files. I have attached the bam files you requested showing the region used here.

Thanks,

Jonathon

On Apr 21, 2014, at 7:17 PM, Martin Morgan <mtmorgan at fhcrc.org<mailto:mtmorgan at fhcrc.org>> wrote:

On 04/21/2014 02:33 PM, Jonathon Hill wrote:
Hi,

I have been trying to use Rsamtools’ applyPileups function to compare two files position-by-position. In order to test it out, I simply ran:
minBaseQuality <- 20
minMapQuality <- 30
minDepth <- 10
maxDepth <- 1000
testparam <- PileupParam(what="seq",
                         which=GRanges(“chr20", IRanges(1, 1000000)),
                         minBaseQuality=minBaseQuality,
                         minMapQuality=minMapQuality,
                         minDepth=minDepth,
                         maxDepth=maxDepth,
)
fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)

My understanding is that this should result in a three-dimensional array with “ACTGN Counts” in the first dimension, files in the second and position in the third. Positions with overlapping reads in both files should thus be collapsed into a single line in the third dimension. However, selecting one of these positions shows that they are duplicated:

r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

I think your understanding is basically correct.

The function is assuming that the BAM files are sorted by position (with, e.g., sortBam, but the files don't have to be sorted by Rsamtools).

Executing a similar command gives me

> str(r3[[1]])
List of 3
$ seqnames: Named int 211195
 ..- attr(*, "names")= chr "chr20"
$ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033 60034 60035 ...
$ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
 ..- attr(*, "dimnames")=List of 3
 .. ..$ : chr [1:5] "A" "C" "G" "T" ...
 .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
 .. ..$ : NULL

Do you get something similar, especially the identical seqnames, pos dimension, and third dimension of seq? 'pos' should apparently be unique; so

> any(duplicated(r3[[1]][["pos"]]))
[1] FALSE

If there are duplicates, I wonder how many there are and where they occur

 pos = r3[[1]][["pos"]]
 table(table(pos))
 udpos = unique(pos[duplicated(pos)])
 head(pos[match(pos, udpos)], 20)
 head(match(pos, udpos), 20)

If nothing is suggested by the above, can you make a subset of the BAM files available to me, e.g., the result of

 param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
 filterBam(fls[1], tempfile(), param=param)

Thanks,

Martin

yields:

, , 1

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                                                   0
C                                                   0                                                   0
G                                                  10                                                   0
T                                                   0                                                   0
N                                                   0                                                   0

, , 2

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                                                   0
C                                                   0                                                   0
G                                                   0                                                  13
T                                                   0                                                   0
N                                                   0                                                   0

Even though the position is the same, it is showing up twice. Each time, one of the files shows zeroes. This is not consistent with what happens if the files are identical (as in the example from the help docs).

For example,

r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam) #file 1 entered twice
 r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

yields:

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                                                   0
C                                                   0                                                   0
G                                                  10                                                  10
T                                                   0                                                   0
N                                                   0                                                   0

Is this the expected behavior? It seems like each position should only show up once in the output. Is there something I am missing?

Thanks,

Jonathon Hill
Postdoc
Yost Lab, University of Utah



[[alternative HTML version deleted]]



_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list