[R] Transport and Earth Mover's Distance
Schuhmacher, Dominic
dominic.schuhmacher at mathematik.uni-goettingen.de
Wed Mar 8 11:28:59 CET 2017
Dear Lorenzo,
No, the code does not do what you are after. R-package transport is for point patterns and histograms in two and more dimensions. You have a distribution in one dimension.
> 1) two distributions with the same bins (I identify each bin by the
> central point in the bin).
>
> n_bin <- 11 # number of bins
>
> bin_structure <- seq(10, by=1, len=n_bin)
>
> set.seed(1234)
>
> x_counts <- rpois(n_bin, 10)
> y_counts <- rpois(n_bin, 10)
>
> x <- pp(as.matrix(cbind(bin_structure, x_counts)))
# 11 points in two dimensions with x-coordinates bin_structure and y-coordinates x_counts
> y <- pp(as.matrix(cbind(bin_structure, y_counts)))
# 11 points in two dimensions with x-coordinates bin_structure and y-coordinates y_counts
> match <- transport(x,y,p=1)
# compute the optimal transport between the two point patterns (assuming mass 1 at each point), i.e. simply match the points in 2-d
> plot(x,y,match)
> wasserstein_dist <- wasserstein(x,y,p=1,match)
# the average distance by which points are moved.
You could trick transport into solving one-dimensional problems. But given that one-dimensional optimal transport is computationally simple, this would be highly inefficient. Depending on what you want to do with „additional mass“ (in your examples, you have count data with differing total counts), the following might be what you want:
x_prob <- x_counts/sum(x_counts)
y_prob <- y_counts/sum(y_counts)
sum(abs(cumsum(x_prob)-cumsum(y_prob)))
This gives the 1-Wasserstein distance between the probability histograms (i.e. the empirical distributions), assuming as in your example equally spaced-bins of width 1. You need the same bins for both samples, but you can treat your Example 2 by adding bins with count 0 to x_counts and y_counts.
>>
>> If you have no particular need for binning, check out the function
>> pppdist in the R-package spatstat, which offers a more flexible way
>> to deal with point patterns of different size.
>
>
> Well, this is not clear, but possibly very important for me.
> My raw data consists of 2 univariate samples of unequal length.
>
> suppose that
>
> x<-rnorm(100)
>
> and
>
> y<-rnorm(90)
>
> Is there a way to define the Wasserstein distance between them without
> going through the binning procedure?
>
Define, yes: the 1-Wasserstein distance in one-dimension is the area between the empirical cumulative distribution functions. If the samples had the same lengths this could be directly computed by
mean(abs(sort(x)-sort(y)))
Otherwise this needs some lines of code. I will include it in the next version of the transport package (soon).
Best regards,
Dominic
> Am 07.03.2017 um 16:32 schrieb Lorenzo Isella <lorenzo.isella at gmail.com>:
>
> Dear Dominic,
> Thanks a lot for the quick reply.
> Just a few questions to make sure I got it all right (I now understand that
> transport and spatstat in particular can do much more than I need
> right now).
> Essentially I am after the Wasserstein distance between univariate
> distributions (and it would be great if I can extend it to the
> case of two distributions with a different bin structure).
>
> 1) two distributions with the same bins (I identify each bin by the
> central point in the bin).
>
> n_bin <- 11 # number of bins
>
> bin_structure <- seq(10, by=1, len=n_bin)
>
> set.seed(1234)
>
> x_counts <- rpois(n_bin, 10)
> y_counts <- rpois(n_bin, 10)
>
> x <- pp(as.matrix(cbind(bin_structure, x_counts)))
> y <- pp(as.matrix(cbind(bin_structure, y_counts)))
>
>
> match <- transport(x,y,p=1)
> plot(x,y,match)
> wasserstein_dist <- wasserstein(x,y,p=1,match)
>
>
> 2) Now I do not have the same bin structure
>
>
> y2 <- pp(as.matrix(cbind(bin_structure+2, y_counts)))
>
>
> match <- transport(x,y2,p=1)
> plot(x,y2,match)
> wasserstein_dist2 <- wasserstein(x,y2,p=1,match)
>
>
> Do 1) and 2) make sense?
>
>>
>> If you have no particular need for binning, check out the function
>> pppdist in the R-package spatstat, which offers a more flexible way
>> to deal with point patterns of different size.
>
>
> Well, this is not clear, but possibly very important for me.
> My raw data consists of 2 univariate samples of unequal length.
>
> suppose that
>
> x<-rnorm(100)
>
> and
>
> y<-rnorm(90)
>
> Is there a way to define the Wasserstein distance between them without
> going through the binning procedure?
>
>
>
> Many thanks!
>
> Lorenzo
------------------------------------
Prof. Dr. Dominic Schuhmacher
Institute for Mathematical Stochastics
University of Goettingen
Goldschmidtstrasse 7
D-37077 Goettingen
Germany
Phone: +49 (0)551 39172107
E-mail: dominic.schuhmacher at mathematik.uni-goettingen.de
More information about the R-help
mailing list