[R] Help with Kmeans output and using broom to tidy etc..

Poling, William Po||ngW @end|ng |rom @etn@@com
Tue May 12 18:10:55 CEST 2020


Hello Eric, thank you so much for your consideration.

Here are snippets of data that I hope will be helpful

WHP 

geo1a <- geo1[, c(2:5)] <-- eliminating ID which is not useful for my purposes anyway

#This is for R-Help use
geo1a <- geo1a %>% top_n(25)

state           city latitude longitude
1     ME      FAIRFIELD 44.64485 -69.65948
2     ME      JONESPORT 44.57935 -67.56743
3     ME        CASWELL 46.97529 -67.83023
4     ME      ELLSWORTH 44.52916 -68.38717
5     ME     VASSALBORO 44.45095 -69.60629
6     ME          UNION 44.20059 -69.26123
7     ME        PALERMO 44.45142 -69.41115
8     ME          ORONO 44.87426 -68.68327
9     ME    SANGERVILLE 45.10138 -69.33580
10    ME      ISLESBORO 44.29015 -68.90812
11    ME        TOPSHAM 43.93600 -69.96565
12    ME       FREEPORT 43.84089 -70.11160
13    ME      SKOWHEGAN 44.76687 -69.71644
14    ME    MILLINOCKET 45.65501 -68.70261
15    ME      ORRINGTON 44.72417 -68.74026
16    ME     ST. GEORGE 43.96726 -69.20827
17    ME FORT FAIRFIELD 46.80911 -67.88079
18    ME      MARS HILL 46.56580 -67.89006
19    ME       FREEPORT 43.85302 -70.03726
20    ME         EASTON 46.64143 -67.91203
21    ME     WATERVILLE 44.53621 -69.65913
22    ME      BRUNSWICK 43.87771 -69.96297
23    ME      BRUNSWICK 43.91719 -69.89905
24    ME      BUCKSPORT 44.60665 -68.81892
25    ME        FAYETTE 44.46380 -70.12047


trnd1_tbla <- trnd1_tbl %>% top_n(25)
print(trnd1_tbla)
head(trnd1_tbla,n=25)

A tibble: 25 x 5
   city      state Basecountsum Basecount2 prop_of_total
   <fct>     <fct>        <dbl>      <dbl>         <dbl>
 1 ATLANTA   GA            2352         12       0.00510
 2 BRADENTON FL            2352          8       0.00340
 3 BROOKLYN  NY            2352         30       0.0128 
 4 CHARLOTTE NC            2352          8       0.00340
 5 CHICAGO   IL            2352         17       0.00723
 6 COLUMBUS  OH            2352         11       0.00468
 7 CUMMING   GA            2352          8       0.00340
 8 DALLAS    TX            2352          8       0.00340
 9 ERIE      PA            2352         12       0.00510
10 HOUSTON   TX            2352         12       0.00510
# ... with 15 more rows

WHP

From: Eric Berger <ericjberger using gmail.com> 
Sent: Tuesday, May 12, 2020 8:39 AM
To: Poling, William <PolingW using aetna.com>
Cc: r-help using r-project.org
Subject: [EXTERNAL] Re: [R] Help with Kmeans output and using broom to tidy etc..

**** External Email - Use Caution ****
Can you create a reproducible example? 
Your question involves objects that are unknown to us. (geo1, trnd1_tbl)

On Tue, May 12, 2020 at 2:41 PM Poling, William via R-help <mailto:r-help using r-project.org> wrote:
#RStudio Version Version 1.2.1335 need this one--> 1.2.5019
sessionInfo() 
# R version 4.0.0 Patched (2020-05-03 r78349)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 10 x64 (build 17763)

Hello:

I have data that I am trying to manipulate for Kmeans clustering.

Original data looks like this

str(geo1) 
# 'data.frame': 2352 obs. of  5 variables:
# $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690 1336 86 726 1702 ...
# $ state           : Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9 32 13 31 12 12 ...
# $ city            : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230 698 965 1330 515 727 1127 1304 ...
# $ latitude        : num  40.4 31.2 40.8 42.1 26.8 ...
# $ longitude       : num  -79.9 -81.5 -74 -91.6 -82.1 ...

I created a subset adding column prop_of_total 
str(trnd1_tbl)
tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame)
 $ city         : Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ state        : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36 10 11 26 38 ...
 $ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ...
 $ Basecount2   : num [1:1457] 1 1 1 1 1 2 1 1 2 1 ...
 $ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425 0.000425 ...


Then I spread it

trnd2_tbl <- trnd1_tbl %>% 
    dplyr::select(city, state, prop_of_total) %>% 
    spread(key = city, value = prop_of_total, fill = 0) #remove the NA's with fill

str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame)

Then I run a Kmeans

kmeans_obj1 <- trnd2_tbl  %>% 
  dplyr::select(- state) %>% 
  kmeans(centers = 20, nstart = 100)

str(kmeans_obj1)
List of 9
 $ cluster     : int [1:41] 11 11 9 11 11 4 11 11 16 2 ...
 $ centers     : num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:20] "1" "2" "3" "4" ...
  .. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ...
 $ totss       : num 0.00158
 $ withinss    : num [1:20] 0 0 0 0 0 0 0 0 0 0 ...
 $ tot.withinss: num 0.0000848
 $ betweenss   : num 0.0015
 $ size        : int [1:20] 1 1 1 1 1 1 1 1 1 1 ...
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

Then I go and try to tidy:

#Tidy, glance, augment
#Just makes it easier to use or view the obj's in the obj list

  broom::tidy(kmeans_obj1) %>% glimpse()

        broom::glance(kmeans_obj1)
##A tibble: 1 x 4
# totss tot.withinss betweenss  iter
# <dbl>        <dbl>     <dbl> <int>
#   1 0.00158    0.0000848   0.00150     3

However, when I run this piece I get an error:

broom::augment(kmeans_obj1, trnd2_tbl) %>% 
  dplyr::select(city, .cluster)             

#Error: Must subset columns with a valid subscript vector.
# The subscript has the wrong type `data.frame<
 # u: double
#  x: double
>`.
i It must be numeric or character.

Here is the back trace:

rlang::last_error()

# Backtrace:
#   1. broom::augment(kmeans_obj1, trnd2_tbl)
# 9. dplyr::select(., city, .cluster)
# 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
# 12. tidyselect:::eval_select_impl(...)
# 20. tidyselect:::vars_select_eval(...)
# 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
# 22. tidyselect:::eval_c(expr, data_mask, context_mask)
# 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
# 24. tidyselect:::walk_data_tree(new, data_mask, context_mask)
# 25. tidyselect:::as_indices_sel_impl(...)
# 26. tidyselect:::as_indices_impl(x, vars, strict = strict)
# 27. vctrs::vec_as_subscript(x, logical = "error")

I am not sure what I am supposed to fix?

Maybe someone has had similar error and can advise me please?

Thank you.

WHP







Proprietary

NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}}

______________________________________________
mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=eSV6ISkAsnmonaRvNdtmx4Lr9vumgXwMYF87DoRP86s&e=
PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=8wmXM73ofNcrn1i9gF-qxOzj7zRJZSPcaA5qg0vggG4&e=
and provide commented, minimal, self-contained, reproducible code.

Proprietary

NOTICE TO RECIPIENT OF INFORMATION:
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately.  
This e-mail may also contain protected health information (PHI) with information about sensitive medical conditions, including, but not limited to, treatment for substance use disorders, behavioral health, HIV/AIDS, or pregnancy. This type of information may be protected by various federal and/or state laws which prohibit any further disclosure without the express written consent of the person to whom it pertains or as otherwise permitted by law. Any unauthorized further disclosure may be considered a violation of federal and/or state law. A general authorization for the release of medical or other information may NOT be sufficient consent for release of this type of information.
Thank you. Aetna


More information about the R-help mailing list