[R] stats 'dist' euclidean distance calculation

Cheyenne Yancey Payne cypayne at stanford.edu
Thu Mar 15 01:11:43 CET 2018


Hello,

I am working with a matrix of multilocus genotypes for ~180 individual snail samples, with substantial missing data. I am trying to calculate the pairwise genetic distance between individuals using the stats package 'dist' function, using euclidean distance. I took a subset of this dataset (3 samples x 3 loci) to test how euclidean distance is calculated:

3x3 subset used
                         Locus1     Locus2         Locus3
Samp1               GG           <NA>           GG
Samp2               AG             CA              GA
Samp3               AG             CA              GG

The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2))
My assumption was that the difference between x_i and y_i would be the number of allelic differences at each base pair site between samples. For example, the euclidean distance between Samp1 and Samp2 would be:

euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 - S2_L3)^2 )
at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1
at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4
at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1

euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940

Calculating euclidean distances this way, the distance matrix should be:
#                   Samp1               Samp2             Samp3
# Samp1       0.000000           2.449400          2.236068
# Samp2       2.449400           0.000000          1.000000
# Samp3       2.236068           1.000000          0.000000

However, this distance matrix differs from the one calculated by the R stats package 'dist' function:
#                   Samp1               Samp2             Samp3
# Samp1       0.000000           3.478652          2.659285
# Samp2       3.478652           0.000000          2.124787
# Samp3       2.659285           2.124787          0.000000

I used the following code (with intermediate output) to achieve the latter distance matrix:


>>>
setwd("~/Desktop/R_stuff")

### Data Prep: load and collect info from matrix file
infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt'
Mydata <- read.table(infile, header = TRUE, check.names = FALSE)
dim(Mydata) # dimensions of data.frame
ind <- as.character(Mydata$sample) # individual labels
population <- as.character(Mydata$region) # population labels
location <- Mydata$location

### Section 1: Convert data to usable format
# removes non-genotype data from matrix (i.e. lines 1-4)
# subset 3 samples, 3 loci for testing
SAMPS<-3
locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))]
locus
#                       Locus1     Locus2         Locus3
# Samp1               GG           <NA>           GG
# Samp2               AG             CA              GA
# Samp3               AG             CA              GG

# converts geno matrix to genind object (adegenet)
Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop = population[1:SAMPS], sep="")
Mydata1$tab # get stats on newly created genind object
#                      Locus1.G     Locus1.A    Locus2.C    Locus2.A    Locus3.G    Locus3.A
# Samp1                  2                  0               NA             NA             2                 0
# Samp2                  1                  1                1                1               1                 1
# Samp3                  1                  1                1                1               2                 0

### Section 2: Individual genetic distance: euclidean distance (dist {adegenet})

# Test dist() function
# collect euclidean genetic distances
distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
distgenEUCL
#                     Samp1              Samp2
# Samp2         2.449490
# Samp3         1.732051           1.414214

x<-as.matrix(dist(distgenEUCL))
x
#                   Samp1               Samp2             Samp3
# Samp1       0.000000           3.478652          2.659285
# Samp2       3.478652           0.000000          2.124787
# Samp3       2.659285           2.124787          0.000000
>>>



I have checked several forums and have been unable to resolve this discrepancy. Any and all help will be much appreciated!


Thank you!

Cheyenne


Cheyenne Payne

Bioinformatics Technician

Palumbi Lab | Hopkins Marine Station

(714) 200-5897

	[[alternative HTML version deleted]]



More information about the R-help mailing list