[R] stats 'dist' euclidean distance calculation
Eric Berger
ericjberger at gmail.com
Thu Mar 15 11:05:08 CET 2018
Hi Cheyenne,
I noticed one thing that might be helpful to you.
First, I took a shortcut to the case of interest:
> m <- matrix(c(2,1,1,0,1,1,NA,1,1,NA,1,1,2,1,2,0,1,0),nrow=3)
> colnames(m) <- c("1.G","1.A","2.C","2.A","3.G","3.A")
> m
# 1.G 1.A 2.C 2.A 3.G 3.A
# [1,] 2 0 NA NA 2 0
# [2,] 1 1 1 1 1 1
# [3,] 1 1 1 1 2 0
Computing the distance between the different rows by hand - TREATING THE
NA's AS ZERO -
would give:
dist(row1,row2) = sqrt( 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6) = 2.45
dist(row1,row3) = sqrt( 1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 0^2) = sqrt(4) = 2
dist(row2,row3) = sqrt( 0^2 + 0^2 + 0^2 + 0^2 + 1^2 + 1^2) = sqrt(2) =
1.414
Doing the same calculation with the dist() function gives
> dist(m)
# 1 2
# 2 2.45
# 3 1.73 1.414
i.e. the results match with the manual calculation for dist(row1,row2) and
dist(row2,row3).
However for dist(row1,row3) which should be 2, the dist function gives 1.73
= sqrt(3).
Clearly sqrt(3) makes no sense since |1 - NA|^2 appears twice. Either both
times it should get a value of 1 or neither time. Why only once? Not clear
to me and I did not see any hints on ?dist.
However if you replace the NA's by actual 0's, which seems to be your
preferred methodology, then the problem is "solved", i.e.
> m2 <- m
> m2[1,][is.na(m2[1,])] <- 0
> dist(m2)
# 1 2
# 2 2.45
# 3 2 1.414
HTH,
Eric
On Thu, Mar 15, 2018 at 2:11 AM, Cheyenne Yancey Payne <cypayne at stanford.edu
> wrote:
> Hello,
>
> I am working with a matrix of multilocus genotypes for ~180 individual
> snail samples, with substantial missing data. I am trying to calculate the
> pairwise genetic distance between individuals using the stats package
> 'dist' function, using euclidean distance. I took a subset of this dataset
> (3 samples x 3 loci) to test how euclidean distance is calculated:
>
> 3x3 subset used
> Locus1 Locus2 Locus3
> Samp1 GG <NA> GG
> Samp2 AG CA GA
> Samp3 AG CA GG
>
> The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2))
> My assumption was that the difference between x_i and y_i would be the
> number of allelic differences at each base pair site between samples. For
> example, the euclidean distance between Samp1 and Samp2 would be:
>
> euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 -
> S2_L3)^2 )
> at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1
> at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4
> at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1
>
> euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940
>
> Calculating euclidean distances this way, the distance matrix should be:
> # Samp1 Samp2 Samp3
> # Samp1 0.000000 2.449400 2.236068
> # Samp2 2.449400 0.000000 1.000000
> # Samp3 2.236068 1.000000 0.000000
>
> However, this distance matrix differs from the one calculated by the R
> stats package 'dist' function:
> # Samp1 Samp2 Samp3
> # Samp1 0.000000 3.478652 2.659285
> # Samp2 3.478652 0.000000 2.124787
> # Samp3 2.659285 2.124787 0.000000
>
> I used the following code (with intermediate output) to achieve the latter
> distance matrix:
>
>
> >>>
> setwd("~/Desktop/R_stuff")
>
> ### Data Prep: load and collect info from matrix file
> infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt'
> Mydata <- read.table(infile, header = TRUE, check.names = FALSE)
> dim(Mydata) # dimensions of data.frame
> ind <- as.character(Mydata$sample) # individual labels
> population <- as.character(Mydata$region) # population labels
> location <- Mydata$location
>
> ### Section 1: Convert data to usable format
> # removes non-genotype data from matrix (i.e. lines 1-4)
> # subset 3 samples, 3 loci for testing
> SAMPS<-3
> locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))]
> locus
> # Locus1 Locus2 Locus3
> # Samp1 GG <NA> GG
> # Samp2 AG CA GA
> # Samp3 AG CA GG
>
> # converts geno matrix to genind object (adegenet)
> Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop =
> population[1:SAMPS], sep="")
> Mydata1$tab # get stats on newly created genind object
> # Locus1.G Locus1.A Locus2.C Locus2.A
> Locus3.G Locus3.A
> # Samp1 2 0 NA
> NA 2 0
> # Samp2 1 1 1
> 1 1 1
> # Samp3 1 1 1
> 1 2 0
>
> ### Section 2: Individual genetic distance: euclidean distance (dist
> {adegenet})
>
> # Test dist() function
> # collect euclidean genetic distances
> distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper =
> FALSE, p = 2)
> distgenEUCL
> # Samp1 Samp2
> # Samp2 2.449490
> # Samp3 1.732051 1.414214
>
> x<-as.matrix(dist(distgenEUCL))
> x
> # Samp1 Samp2 Samp3
> # Samp1 0.000000 3.478652 2.659285
> # Samp2 3.478652 0.000000 2.124787
> # Samp3 2.659285 2.124787 0.000000
> >>>
>
>
>
> I have checked several forums and have been unable to resolve this
> discrepancy. Any and all help will be much appreciated!
>
>
> Thank you!
>
> Cheyenne
>
>
> Cheyenne Payne
>
> Bioinformatics Technician
>
> Palumbi Lab | Hopkins Marine Station
>
> (714) 200-5897
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list