[R-sig-eco] FW: Calculating percentile rank of sample dataset compared to reference dataset in R

Glatthorn, Jonas jg|@tth @end|ng |rom gwdg@de
Thu Aug 22 08:28:05 CEST 2019


Dear Matt,

I believe the ecdf() function can do as well what you are looking for:

ref_ecdf <- sapply(df_ref, FUN = ecdf)

and then apply each function in ref_ecdf to the corresponding column in df_sample. Either with a for loop or (my preference) using functionals:

df_sample_rank <- purrr::map2_dfc(ref_ecdf, purrr::map(df_sample[-1], list), do.call)

all the best

Jonas

-----Original Message-----
From: R-sig-ecology <r-sig-ecology-bounces using r-project.org> On Behalf Of Bede-Fazekas Ákos
Sent: Thursday, 22 August 2019 08:01
To: r-sig-ecology using r-project.org
Subject: Re: [R-sig-eco] FW: Calculating percentile rank of sample dataset compared to reference dataset in R

Dear Matthew,

here is one, maybe not the fastest/shortest, solution:
percentiles <- apply(X = df_ref, MARGIN = 2, FUN = quantile, probs = seq(from = 0, to = 1, length.out = 101)[-1]) df_sample$percentile_rank <- vapply(X = colnames(df_sample)[-1], FUN.VALUE = numeric(nrow(df_sample)), FUN = function(variable_name) findInterval(x = df_sample[, variable_name, drop = TRUE], vec = percentiles[, variable_name, drop = TRUE]))

HTH,
Ákos Bede-Fazekas
Hungarian Academy of Sciences

2019.08.22. 0:54 keltezéssel, Shank, Matthew írta:
> Hello R-sig-ecology mailing list,
>
>
>
> I’m working on a mutlivariate water quality index where the concentration of parameter i at site j is normalized by calculating the percentile rank of the value using a much larger reference dataset.
>
>
>
> As an example, I have generated a sample dataset of water quality parameters (df_sample) and a larger reference dataset (df_ref). I’d like to calculate the percentile rank of each parameter, at each site, using a reference dataset of a much larger size.
>
>
>
> Example data is below. If anyone has a solution that avoids for loops that would be preferred.
>
>
>
>
>
> #generate sample data
>
> df_sample <- data.frame(site = letters[1:10], iron = runif(10, min=0, 
> max=1), nitrate = runif(10, min=0, max=10))
>
> df_sample
>
>
>
>
>
> #generate reference dataset
>
> df_ref <- data.frame(iron = seq(0, 1, length.out = 1000), nitrate = 
> seq(0, 10, length.out = 1000))
>
> df_ref
>
> # now would like to calculate percentile rank of iron and nitrate at 
> all sites (a:j) based on identical columns in df_ref and include as a 
> new column in df_sample
>
>
>
> Many thanks,
> |><̮Mâ̬tt͵)o>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology using r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


More information about the R-sig-ecology mailing list