[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Simon Anders anders at ebi.ac.uk
Sun Jul 5 13:35:35 CEST 2009


Hi Patrick

Patrick Aboyoun wrote:
> I worked on beefing up the low-level IRanges Sequence class, which has
> around 80 subclasses including GenomeData, GenomeDataList, RleList, to
> include functional programming tools like Reduce, Filter, Find, Map,
[...]

Thanks, this looks quite useful.


I've just played with it and found one issue with chromsome names.

It seems that you Map function finds corresponding Rle vectors in two
GenomeData objects by looking at the list indices but not at the names.

Here you have two GenomeData objects which have the same values but
swapped chromosome names:

> gd1 <- GenomeData( list(
    chrA = Rle( rep( c(0,3,0), each=20 ) ),
    chrB = Rle( rep( c(0,2,0), each=30 ) ) ) )

> gd2 <- GenomeData( list(
    chrB = Rle( rep( c(0,3,0), each=20 ) ),
    chrA = Rle( rep( c(0,2,0), each=30 ) ) ) )

If I add them up, I do not add chromosome A to chromosome A, but the
first chromsome to the first one:

> Map( "+", gd1, gd2 )
$chrA
  'numeric' Rle instance of length 60 with 3 runs
  Lengths:  20 20 20
  Values :  0 6 0

$chrB
  'numeric' Rle instance of length 90 with 3 runs
  Lengths:  30 30 30
  Values :  0 4 0

This is of course consistent with what base::Map does but violated the
semantics of the GenomeData object. Imagine you use 'coverage' on two
sets of aligned reads, and for some reasons, one chromosome never
appears in the first set, and another one never appears in the second
one. You will have two GenomeData objects with the same number of
chromosomes, and their sum will be completely wrong.


Martin's gdreduce function, as I've just noticed, does not have this issue:

> str( gdreduce( "+", gd1, gd2 ) )
Formal class 'GenomeData' [package "BSgenome"] with 4 slots
  ..@ listData       :List of 2
  .. ..$ chrA:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  .. ..$ chrB:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  ..@ elementMetadata: NULL
  ..@ elementType    : chr "ANYTHING"
  ..@ metadata       :List of 3
  .. ..$ organism       : NULL
  .. ..$ provider       : NULL
  .. ..$ providerVersion: NULL
Warning messages:
1: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length
2: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length


Finally: Why did you call it 'map' and Martin called it 'reduce'?
Actually, I think, 'map' is correct for the two-argument case. See the
following use-case here for an example of reducing with a map function.
Assuming that 'lanes' is a list of 'AlignedRead' objects, I expect to
get the coverage, summed over all these lanes (which are from the same
condition), by writing something like

coverageOfAllLanes <-
   Reduce(
      function( gd1, gd2 ) Map( "+", gd1, gd2 ),
      lapply( lanes, coverage, width = chromLengths ) )


Cheers
  Simon



More information about the Bioc-sig-sequencing mailing list