[R] find unique and summerize

Mon Feb 5 02:20:49 CET 2018

Thank you so much Rui!

On Sun, Feb 4, 2018 at 12:20 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:

> Hello,
>
> Please always cc the list.
>
> As for the question, I believe the following does it.
>
> a <- strsplit(mydata$ID, "[[:alpha:]]+")
> b <- strsplit(mydata$ID, "[[:digit:]]+")
>
> a <- sapply(a, `[`, 1)
> c <- sapply(a, `[`, 2)
> b <- sapply(b, function(x) x[x != ""])
>
> c2 <- sprintf("%010d", as.integer(c))
>
> newID <- paste0(a, b, c2)
>
>
> Hope this helps,
>
> Rui Barradas
>
> On 2/4/2018 2:01 AM, Val wrote:
>
>> Thank you so much again for your help!
>>
>> I have one more question related to this.
>>
>> 1. How do I further split  this "358USA1540165 " into three parts.
>> a) 358
>> b) USA
>> c) 1540165
>>
>> I want to add leading zeros to the third part  like "0001540165"
>> and then combine   b and c  to get this USA1540165
>> so USA1540165  changed to USA1540165
>>
>> The other one is that the data set has several country codes and if I
>> want to limit my data set to only certain country codes , how do I do that.
>>
>> Thank you again
>>
>>
>>
>>
>> On Sat, Feb 3, 2018 at 1:05 PM, Rui Barradas <ruipbarradas at sapo.pt
>> <mailto:ruipbarradas at sapo.pt>> wrote:
>>
>>     Hello,
>>
>>     As for the first question, instead of writing a xlsx file, maybe it
>>     is easier to write a csv file and then open it with Excel.
>>
>>     tbl2 <- addmargins(tbl1)
>>     write.csv(tbl2, "tt1.csv")
>>
>>     As for the second question, the following does it.
>>
>>     inx <- apply(tbl1, 1, function(x) all(x != 0))
>>     tbl1b <- addmargins(tbl1[inx, ])
>>     tbl1b
>>
>>
>>     Hope this helps,
>>
>>     Rui Barradas
>>
>>     On 2/3/2018 4:42 PM, Val wrote:
>>
>>         Thank you so much Rui.
>>
>>         1. How do I export this table to excel file?
>>         I used this
>>                 tbl1 <- table(Country, IDNum)
>>                 tbl2=addmargins(tbl1)
>>                 write.xlsx(tbl2,"tt1.xlsx"),sheetName="summary",
>>         row.names=FALSE)
>>         The above did not give me that table.
>>
>>
>>         2. I want select those unique Ids that do have records in all
>>         countries.
>>            From the above data set, this ID  "FIN1540166"  should be
>>         excluded from the summary table and the table looks like as follow
>>
>>         IDNum Country 1 33 358 44 Sum CAN1540164 47 141 248 90 526
>>         USA1540165 290 757 321 171 1539 Sum 337 898 569 261 2065
>>
>>         Thank you again
>>
>>
>>         On Fri, Feb 2, 2018 at 11:26 PM, Rui Barradas
>>         <ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>
>>         <mailto:ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>>>
>> wrote:
>>
>>              Hello,
>>
>>              Thanks for the reproducible example.
>>              See if the following does what you want.
>>
>>              IDNum <- sub("^(\\d+).*", "\\1", mydata$ID)
>>              Country <- sub("^\\d+(.*)", "\\1", mydata$ID)
>>
>>              tbl1 <- table(Country, IDNum)
>>              addmargins(tbl1)
>>
>>              tbl2 <- xtabs(Y ~ Country + IDNum, mydata)
>>              addmargins(tbl2)
>>
>>
>>              Hope this helps,
>>
>>              Rui Barradas
>>
>>
>>              On 2/3/2018 3:00 AM, Val wrote:
>>
>>                  Hi all,
>>
>>                  I have a data set  need to be summarized by unique ID
>>         (count and
>>                  sum of a
>>                  variable)
>>                  A unique individual ID (country name  Abbreviation
>>    followed by
>>                  an integer
>>                  numbers)  may  have observation in several countries.
>>         Then the         ID was
>>                  changed by adding the country code as a prefix  and
>>    new ID was
>>                  constructed
>>                  or recorded like (country code, + the original unique
>>         ID  Example
>>                  original ID   "CAN1540164" , if this ID has an
>>         observation in
>>                  CANADA then
>>                  the ID was changed to    "1CAN1540164".   From this new
>>         ID I
>>                  want get out
>>                  the country code  get the  original unique ID  and
>>      summarize
>>                  the data by
>>                  unique ID and country code
>>
>>                  The data set look like
>>                  mydata <- read.table(textConnection("GR ID iflag Y
>>                  A 1CAN1540164 1 20
>>                  A 1CAN1540164 1 12
>>                  A 1CAN1540164 1 15
>>                  A 44CAN1540164 1 30
>>                  A 44CAN1540164 1 24
>>                  A 44CAN1540164 1 25
>>                  A 44CAN1540164 1 11
>>                  A 33CAN1540164 1 12
>>                  A 33CAN1540164 1 23
>>                  A 33CAN1540164 1 65
>>                  A 33CAN1540164 1 41
>>                  A 358CAN1540164 1 28
>>                  A 358CAN1540164 1 32
>>                  A 358CAN1540164 1 41
>>                  A 358CAN1540164 1 54
>>                  A 358CAN1540164 1 29
>>                  A 358CAN1540164 1 64
>>                  B 1USA1540165 1 125
>>                  B 1USA1540165 1 165
>>                  B 44USA1540165 1 171
>>                  B 33USA1540165 1 254
>>                  B 33USA1540165 1 241
>>                  B 33USA1540165 1 262
>>                  B 358USA1540165 1 321
>>                  C 358FIN1540166 1 225 "),header = TRUE
>>         ,stringsAsFactors = FALSE)
>>
>>                    From the above data there are three unique IDs and
>> four
>>                  country codes (1,
>>                  44, 33 and 358)
>>
>>                  I want the following two tables
>>
>>                  Table 1. count  the  unique ID by country code
>>                                              1   44   33   358     TOT
>>                  CAN1540164     3    4     4      6        17
>>                  USA1540165      2   1      3     1          7
>>                  FIN1540166       -     -       -      1         1
>>                               TOT         5    5      7      8       25
>>
>>
>>                  Table 2  Sum of Y variable by unique ID and country. code
>>
>>                                              1       44       33
>>    358      TOT
>>                  CAN1540164    47     90      141      248       526
>>                  USA1540165   290   171      757      321     1539
>>                  FIN1540166        -        -         -         225
>>          225
>>                                TOT      337     261      898    794
>>  2290
>>
>>
>>                  How do I do it in R?
>>
>>                     The first step is to get the unique country codes
>>         unique ID
>>                  by splitting
>>                  the new ID
>>
>>                  Thank you in advance
>>
>>                           [[alternative HTML version deleted]]
>>
>>                  ______________________________________________
>>         R-help at r-project.org <mailto:R-help at r-project.org>
>>         <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>>         mailing list
>>                  -- To UNSUBSCRIBE and more, see
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>>                  <https://stat.ethz.ch/mailman/listinfo/r-help
>>         <https://stat.ethz.ch/mailman/listinfo/r-help>>
>>                  PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         <http://www.R-project.org/posting-guide.html>
>>                  <http://www.R-project.org/posting-guide.html
>>         <http://www.R-project.org/posting-guide.html>>
>>                  and provide commented, minimal, self-contained,
>>         reproducible code.
>>
>>
>>
>>

	[[alternative HTML version deleted]]