[R] Creating a log-transformed histogram of multiclass data
Tom Woolman
twoo|m@n @end|ng |rom ont@rgettek@com
Wed Aug 4 01:04:29 CEST 2021
Apologies, I left out 3 critical lines of code after the randomized
sample dataframe is created:
group_a <- d[ which(d$label =='A'), ]
group_b <- d[ which(d$label =='B'), ]
group_c <- d[ which(d$label =='C'), ]
On 2021-08-03 18:56, Tom Woolman wrote:
> # Resending this message since the original email was held in queue by
> the listserv software because of a "suspicious" subject line, and/or
> because of attached .png histogram chart attachments. I'm guessing
> that the listserv software doesn't like multiple image file
> attachments.
>
>
> Hi everyone. I'm working on a research model now that is calculating
> anomaly scores (RMSE values) for three distinct groups within a large
> dataset. The anomaly scores are a continuous data type and are quite
> small, ranging from approximately 1e-04 to 1-e07 across a population
> of approximately 1 million observations.
>
> I have all of the summary and descriptive statistics for each of the
> anomaly score distributions across each group label in the dataset,
> and I am able to create some useful histograms showing how each of the
> three groups is uniquely distributed across the range of scores.
> However, because of the large variance within the frequency of score
> values and the high density peaks within much of the anomaly scores, I
> need to use a log transformation within the histogram to show both the
> log frequency count of each binned observation range (y-axis) and a
> log transformation of the binned score values (x-axis) to be able to
> appropriately illustrate the distributions within the data and make it
> more readily understandable.
>
> Fortunately, ggplot2 is really useful for creating some really
> attractive dual-axis log transformed histograms.
>
> However, I cannot figure out a way to create the log transformed
> histograms to show each of my three groups by color within the same
> histogram. I would want it to look like this, BUT use a log
> transformation for each axis. This plot below shows the 3 groups in
> one histogram but uses the default normal values.
>
> For log transformed axis values, the best I can do so far is produce
> three separate histograms, one for each group.
>
>
>
> Below is sample R code to illustrate my problem with a
> randomly-generated example dataset and the ggplot2 approaches that I
> have taken so far:
>
> # Sample R code below:
>
> library(ggplot2)
> library(dplyr)
> library(hrbrthemes)
>
> # I created some simple random sample data to produce an example
> dataset.
> # This produces an example dataframe called d, which contains a class
> label IV of either A, B or C for each observation. The target variable
> is the anomaly_score continuous value for each observation.
> # There are 300 rows of dummy data in this dataframe.
>
> DV_score_generator = round(runif(300,0.001,0.999), 3)
> d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE,
> prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)
>
> # First, I use ggplot to create the normal distribution histogram that
> shows all 3 groups on the same plot, by color.
> # Please note that with this small set of randomized sample data it
> doesn't appear to be necessary to use an x and y-axis log
> transformation to show the distribution patterns, but it does becomes
> an issue with my vastly larger and more complex score values in the DV
> of the actual data.
>
> p <- d %>%
> ggplot( aes(x=anomaly_score, fill=label)) +
> geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
> scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
> theme_ipsum() +
> labs(fill="")
>
> p
>
> # Produces a normal multiclass histogram.
>
>
>
> # Now produce a series of x and y-axis log-transformed histograms,
> producing one histogram for each distinct label class in the dataset:
>
>
> # Group A, log transformed
>
> ggplot(group_a, aes(x = anomaly_score)) +
> geom_histogram(aes(y = ..count..), binwidth = 0.05,
> colour = "darkgoldenrod1", fill = "darkgoldenrod2") +
> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
> +
> scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +
> ggtitle("Transformed Anomaly Scores - Group A Only")
>
>
> # Group A transformed histogram is produced here.
>
>
>
> # Group B, log transformed
>
> ggplot(group_b, aes(x = anomaly_score)) +
> geom_histogram(aes(y = ..count..), binwidth = 0.05,
> colour = "green", fill = "darkgreen") +
> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
> +
> scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +
> ggtitle("Transformed Anomaly Scores - Group B Only")
>
> # Group B transformed histogram is produced here.
>
>
>
> # Group C, log transformed
>
> ggplot(group_c, aes(x = anomaly_score)) +
> geom_histogram(aes(y = ..count..), binwidth = 0.05,
> colour = "red", fill = "darkred") +
> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
> +
> scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +
> ggtitle("Transformed Anomaly Scores - Group C Only")
>
> # Group C transformed histogram is produced here.
>
>
> # End.
>
>
>
> Thanks in advance, everyone!
>
>
> - Tom
>
>
> Thomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS,
> MS
> On Target Technologies, Inc.
> Virginia, USA
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list