[R] Help with Kmeans output and using broom to tidy etc..
Poling, William
Po||ngW @end|ng |rom @etn@@com
Tue May 12 18:10:55 CEST 2020
Hello Eric, thank you so much for your consideration.
Here are snippets of data that I hope will be helpful
WHP
geo1a <- geo1[, c(2:5)] <-- eliminating ID which is not useful for my purposes anyway
#This is for R-Help use
geo1a <- geo1a %>% top_n(25)
state city latitude longitude
1 ME FAIRFIELD 44.64485 -69.65948
2 ME JONESPORT 44.57935 -67.56743
3 ME CASWELL 46.97529 -67.83023
4 ME ELLSWORTH 44.52916 -68.38717
5 ME VASSALBORO 44.45095 -69.60629
6 ME UNION 44.20059 -69.26123
7 ME PALERMO 44.45142 -69.41115
8 ME ORONO 44.87426 -68.68327
9 ME SANGERVILLE 45.10138 -69.33580
10 ME ISLESBORO 44.29015 -68.90812
11 ME TOPSHAM 43.93600 -69.96565
12 ME FREEPORT 43.84089 -70.11160
13 ME SKOWHEGAN 44.76687 -69.71644
14 ME MILLINOCKET 45.65501 -68.70261
15 ME ORRINGTON 44.72417 -68.74026
16 ME ST. GEORGE 43.96726 -69.20827
17 ME FORT FAIRFIELD 46.80911 -67.88079
18 ME MARS HILL 46.56580 -67.89006
19 ME FREEPORT 43.85302 -70.03726
20 ME EASTON 46.64143 -67.91203
21 ME WATERVILLE 44.53621 -69.65913
22 ME BRUNSWICK 43.87771 -69.96297
23 ME BRUNSWICK 43.91719 -69.89905
24 ME BUCKSPORT 44.60665 -68.81892
25 ME FAYETTE 44.46380 -70.12047
trnd1_tbla <- trnd1_tbl %>% top_n(25)
print(trnd1_tbla)
head(trnd1_tbla,n=25)
A tibble: 25 x 5
city state Basecountsum Basecount2 prop_of_total
<fct> <fct> <dbl> <dbl> <dbl>
1 ATLANTA GA 2352 12 0.00510
2 BRADENTON FL 2352 8 0.00340
3 BROOKLYN NY 2352 30 0.0128
4 CHARLOTTE NC 2352 8 0.00340
5 CHICAGO IL 2352 17 0.00723
6 COLUMBUS OH 2352 11 0.00468
7 CUMMING GA 2352 8 0.00340
8 DALLAS TX 2352 8 0.00340
9 ERIE PA 2352 12 0.00510
10 HOUSTON TX 2352 12 0.00510
# ... with 15 more rows
WHP
From: Eric Berger <ericjberger using gmail.com>
Sent: Tuesday, May 12, 2020 8:39 AM
To: Poling, William <PolingW using aetna.com>
Cc: r-help using r-project.org
Subject: [EXTERNAL] Re: [R] Help with Kmeans output and using broom to tidy etc..
**** External Email - Use Caution ****
Can you create a reproducible example?
Your question involves objects that are unknown to us. (geo1, trnd1_tbl)
On Tue, May 12, 2020 at 2:41 PM Poling, William via R-help <mailto:r-help using r-project.org> wrote:
#RStudio Version Version 1.2.1335 need this one--> 1.2.5019
sessionInfo()
# R version 4.0.0 Patched (2020-05-03 r78349)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 10 x64 (build 17763)
Hello:
I have data that I am trying to manipulate for Kmeans clustering.
Original data looks like this
str(geo1)
# 'data.frame': 2352 obs. of 5 variables:
# $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690 1336 86 726 1702 ...
# $ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9 32 13 31 12 12 ...
# $ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230 698 965 1330 515 727 1127 1304 ...
# $ latitude : num 40.4 31.2 40.8 42.1 26.8 ...
# $ longitude : num -79.9 -81.5 -74 -91.6 -82.1 ...
I created a subset adding column prop_of_total
str(trnd1_tbl)
tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame)
$ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8 9 10 ...
$ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36 10 11 26 38 ...
$ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ...
$ Basecount2 : num [1:1457] 1 1 1 1 1 2 1 1 2 1 ...
$ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425 0.000425 ...
Then I spread it
trnd2_tbl <- trnd1_tbl %>%
dplyr::select(city, state, prop_of_total) %>%
spread(key = city, value = prop_of_total, fill = 0) #remove the NA's with fill
str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame)
Then I run a Kmeans
kmeans_obj1 <- trnd2_tbl %>%
dplyr::select(- state) %>%
kmeans(centers = 20, nstart = 100)
str(kmeans_obj1)
List of 9
$ cluster : int [1:41] 11 11 9 11 11 4 11 11 16 2 ...
$ centers : num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20] "1" "2" "3" "4" ...
.. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ...
$ totss : num 0.00158
$ withinss : num [1:20] 0 0 0 0 0 0 0 0 0 0 ...
$ tot.withinss: num 0.0000848
$ betweenss : num 0.0015
$ size : int [1:20] 1 1 1 1 1 1 1 1 1 1 ...
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
Then I go and try to tidy:
#Tidy, glance, augment
#Just makes it easier to use or view the obj's in the obj list
broom::tidy(kmeans_obj1) %>% glimpse()
broom::glance(kmeans_obj1)
##A tibble: 1 x 4
# totss tot.withinss betweenss iter
# <dbl> <dbl> <dbl> <int>
# 1 0.00158 0.0000848 0.00150 3
However, when I run this piece I get an error:
broom::augment(kmeans_obj1, trnd2_tbl) %>%
dplyr::select(city, .cluster)
#Error: Must subset columns with a valid subscript vector.
# The subscript has the wrong type `data.frame<
# u: double
# x: double
>`.
i It must be numeric or character.
Here is the back trace:
rlang::last_error()
# Backtrace:
# 1. broom::augment(kmeans_obj1, trnd2_tbl)
# 9. dplyr::select(., city, .cluster)
# 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
# 12. tidyselect:::eval_select_impl(...)
# 20. tidyselect:::vars_select_eval(...)
# 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
# 22. tidyselect:::eval_c(expr, data_mask, context_mask)
# 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
# 24. tidyselect:::walk_data_tree(new, data_mask, context_mask)
# 25. tidyselect:::as_indices_sel_impl(...)
# 26. tidyselect:::as_indices_impl(x, vars, strict = strict)
# 27. vctrs::vec_as_subscript(x, logical = "error")
I am not sure what I am supposed to fix?
Maybe someone has had similar error and can advise me please?
Thank you.
WHP
Proprietary
NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}}
______________________________________________
mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=eSV6ISkAsnmonaRvNdtmx4Lr9vumgXwMYF87DoRP86s&e=
PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=8wmXM73ofNcrn1i9gF-qxOzj7zRJZSPcaA5qg0vggG4&e=
and provide commented, minimal, self-contained, reproducible code.
Proprietary
NOTICE TO RECIPIENT OF INFORMATION:
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately.
This e-mail may also contain protected health information (PHI) with information about sensitive medical conditions, including, but not limited to, treatment for substance use disorders, behavioral health, HIV/AIDS, or pregnancy. This type of information may be protected by various federal and/or state laws which prohibit any further disclosure without the express written consent of the person to whom it pertains or as otherwise permitted by law. Any unauthorized further disclosure may be considered a violation of federal and/or state law. A general authorization for the release of medical or other information may NOT be sufficient consent for release of this type of information.
Thank you. Aetna
More information about the R-help
mailing list