[R-sig-eco] FW: Calculating percentile rank of sample dataset compared to reference dataset in R

Thu Aug 22 09:23:05 CEST 2019

Hi,

sapply(c("iron", "nitrate"), function(x) round(approx(y =
1:nrow(df_ref), x = df_ref[, x], xout = df_sample[, x])$y/10))

should do the trick with base R:::approx() as workhorse.

You need to replace the /10 by a value corresponding to the length of
your reference database (e.g. if there are 500 rows only, divide by 5)

The results differs slightly from the solution of Akos by assigning a
value of 0.2651 to percentile rank 27 instead of 26.

Cheers!

On Thu, 22 Aug 2019 at 08:29, Glatthorn, Jonas <jglatth using gwdg.de> wrote:

> Dear Matt,
>
> I believe the ecdf() function can do as well what you are looking for:
>
> ref_ecdf <- sapply(df_ref, FUN = ecdf)
>
> and then apply each function in ref_ecdf to the corresponding column in
> df_sample. Either with a for loop or (my preference) using functionals:
>
> df_sample_rank <- purrr::map2_dfc(ref_ecdf, purrr::map(df_sample[-1],
> list), do.call)
>
> all the best
>
> Jonas
>
> -----Original Message-----
> From: R-sig-ecology <r-sig-ecology-bounces using r-project.org> On Behalf Of
> Bede-Fazekas Ákos
> Sent: Thursday, 22 August 2019 08:01
> To: r-sig-ecology using r-project.org
> Subject: Re: [R-sig-eco] FW: Calculating percentile rank of sample dataset
> compared to reference dataset in R
>
> Dear Matthew,
>
> here is one, maybe not the fastest/shortest, solution:
> percentiles <- apply(X = df_ref, MARGIN = 2, FUN = quantile, probs =
> seq(from = 0, to = 1, length.out = 101)[-1]) df_sample$percentile_rank <-
> vapply(X = colnames(df_sample)[-1], FUN.VALUE = numeric(nrow(df_sample)),
> FUN = function(variable_name) findInterval(x = df_sample[, variable_name,
> drop = TRUE], vec = percentiles[, variable_name, drop = TRUE]))
>
> HTH,
> Ákos Bede-Fazekas
> Hungarian Academy of Sciences
>
> 2019.08.22. 0:54 keltezéssel, Shank, Matthew írta:
> > Hello R-sig-ecology mailing list,
> >
> >
> >
> > I’m working on a mutlivariate water quality index where the
> concentration of parameter i at site j is normalized by calculating the
> percentile rank of the value using a much larger reference dataset.
> >
> >
> >
> > As an example, I have generated a sample dataset of water quality
> parameters (df_sample) and a larger reference dataset (df_ref). I’d like to
> calculate the percentile rank of each parameter, at each site, using a
> reference dataset of a much larger size.
> >
> >
> >
> > Example data is below. If anyone has a solution that avoids for loops
> that would be preferred.
> >
> >
> >
> >
> >
> > #generate sample data
> >
> > df_sample <- data.frame(site = letters[1:10], iron = runif(10, min=0,
> > max=1), nitrate = runif(10, min=0, max=10))
> >
> > df_sample
> >
> >
> >
> >
> >
> > #generate reference dataset
> >
> > df_ref <- data.frame(iron = seq(0, 1, length.out = 1000), nitrate =
> > seq(0, 10, length.out = 1000))
> >
> > df_ref
> >
> > # now would like to calculate percentile rank of iron and nitrate at
> > all sites (a:j) based on identical columns in df_ref and include as a
> > new column in df_sample
> >
> >
> >
> > Many thanks,
> > |><̮Mâ̬tt͵)o>
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-ecology mailing list
> > R-sig-ecology using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>

	[[alternative HTML version deleted]]