[R] stats 'dist' euclidean distance calculation

Thu Mar 15 11:05:08 CET 2018

Hi Cheyenne,

I noticed one thing that might be helpful to you.
First, I took a shortcut to the case of interest:

> m <- matrix(c(2,1,1,0,1,1,NA,1,1,NA,1,1,2,1,2,0,1,0),nrow=3)
> colnames(m) <- c("1.G","1.A","2.C","2.A","3.G","3.A")
> m
#       1.G  1.A  2.C  2.A  3.G  3.A
# [1,]   2    0     NA   NA   2     0
# [2,]   1   1      1      1      1     1
# [3,]   1   1      1      1      2     0

Computing the distance between the different rows by hand - TREATING THE
NA's AS ZERO -
would give:
dist(row1,row2) = sqrt( 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6) = 2.45
dist(row1,row3) = sqrt(  1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 0^2) = sqrt(4) = 2
dist(row2,row3) = sqrt(   0^2 + 0^2 + 0^2 + 0^2 + 1^2 + 1^2) = sqrt(2) =
1.414

Doing the same calculation with the dist() function gives
> dist(m)
#            1                  2
# 2        2.45
# 3        1.73              1.414

i.e. the results match with the manual calculation for dist(row1,row2) and
dist(row2,row3).
However for dist(row1,row3) which should be 2, the dist function gives 1.73
= sqrt(3).
Clearly sqrt(3) makes no sense since |1 - NA|^2 appears twice. Either both
times it should get a value of 1 or neither time. Why only once? Not clear
to me and I did not see any hints on ?dist.

However if you replace the NA's by actual 0's, which seems to be your
preferred methodology, then the problem is "solved", i.e.
> m2 <- m
> m2[1,][is.na(m2[1,])] <- 0
> dist(m2)
#            1                  2
# 2        2.45
# 3        2              1.414

HTH,
Eric

On Thu, Mar 15, 2018 at 2:11 AM, Cheyenne Yancey Payne <cypayne at stanford.edu
> wrote:

> Hello,
>
> I am working with a matrix of multilocus genotypes for ~180 individual
> snail samples, with substantial missing data. I am trying to calculate the
> pairwise genetic distance between individuals using the stats package
> 'dist' function, using euclidean distance. I took a subset of this dataset
> (3 samples x 3 loci) to test how euclidean distance is calculated:
>
> 3x3 subset used
>                          Locus1     Locus2         Locus3
> Samp1               GG           <NA>           GG
> Samp2               AG             CA              GA
> Samp3               AG             CA              GG
>
> The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2))
> My assumption was that the difference between x_i and y_i would be the
> number of allelic differences at each base pair site between samples. For
> example, the euclidean distance between Samp1 and Samp2 would be:
>
> euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 -
> S2_L3)^2 )
> at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1
> at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4
> at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1
>
> euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940
>
> Calculating euclidean distances this way, the distance matrix should be:
> #                   Samp1               Samp2             Samp3
> # Samp1       0.000000           2.449400          2.236068
> # Samp2       2.449400           0.000000          1.000000
> # Samp3       2.236068           1.000000          0.000000
>
> However, this distance matrix differs from the one calculated by the R
> stats package 'dist' function:
> #                   Samp1               Samp2             Samp3
> # Samp1       0.000000           3.478652          2.659285
> # Samp2       3.478652           0.000000          2.124787
> # Samp3       2.659285           2.124787          0.000000
>
> I used the following code (with intermediate output) to achieve the latter
> distance matrix:
>
>
> >>>
> setwd("~/Desktop/R_stuff")
>
> ### Data Prep: load and collect info from matrix file
> infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt'
> Mydata <- read.table(infile, header = TRUE, check.names = FALSE)
> dim(Mydata) # dimensions of data.frame
> ind <- as.character(Mydata$sample) # individual labels
> population <- as.character(Mydata$region) # population labels
> location <- Mydata$location
>
> ### Section 1: Convert data to usable format
> # removes non-genotype data from matrix (i.e. lines 1-4)
> # subset 3 samples, 3 loci for testing
> SAMPS<-3
> locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))]
> locus
> #                       Locus1     Locus2         Locus3
> # Samp1               GG           <NA>           GG
> # Samp2               AG             CA              GA
> # Samp3               AG             CA              GG
>
> # converts geno matrix to genind object (adegenet)
> Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop =
> population[1:SAMPS], sep="")
> Mydata1$tab # get stats on newly created genind object
> #                      Locus1.G     Locus1.A    Locus2.C    Locus2.A
> Locus3.G    Locus3.A
> # Samp1                  2                  0               NA
>  NA             2                 0
> # Samp2                  1                  1                1
>     1               1                 1
> # Samp3                  1                  1                1
>     1               2                 0
>
> ### Section 2: Individual genetic distance: euclidean distance (dist
> {adegenet})
>
> # Test dist() function
> # collect euclidean genetic distances
> distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper =
> FALSE, p = 2)
> distgenEUCL
> #                     Samp1              Samp2
> # Samp2         2.449490
> # Samp3         1.732051           1.414214
>
> x<-as.matrix(dist(distgenEUCL))
> x
> #                   Samp1               Samp2             Samp3
> # Samp1       0.000000           3.478652          2.659285
> # Samp2       3.478652           0.000000          2.124787
> # Samp3       2.659285           2.124787          0.000000
> >>>
>
>
>
> I have checked several forums and have been unable to resolve this
> discrepancy. Any and all help will be much appreciated!
>
>
> Thank you!
>
> Cheyenne
>
>
> Cheyenne Payne
>
> Bioinformatics Technician
>
> Palumbi Lab | Hopkins Marine Station
>
> (714) 200-5897
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]