[R] simplify a dataframe

Mon Jul 15 03:47:00 CEST 2013

HI Michel,
This gives the same order as that of df2.
df1$contrat[grep("^CDD",df1$contrat)]<- "CDD détaché ext. Cirad"
df1[48,8]<- "31/12/2013"
indx<-as.numeric(interaction(df1[,1:6],drop=TRUE))
lst1<-split(df1,indx)
 lst2<-lst1[match(unique(indx),names(lst1))]
res<-do.call(rbind,lapply(lst2,function(x){x1<- as.Date(x$Debut,format="%d/%m/%Y");x2<- as.Date(x$Fin,format="%d/%m/%Y");do.call(rbind,lapply(split(x,cumsum(c(FALSE,(x1[-1]-x2[-nrow(x)])!=1))),function(x) data.frame(x[1,1:6],Debut=head(x$Debut,1),Fin=tail(x$Fin,1),stringsAsFactors=FALSE)))}))
 row.names(res)<- 1:nrow(res)
 df2[11,8]<- "31/12/2013"
 names(res)[1]<- "Mat"
 identical(res,df2)
#[1] TRUE

A.K.

----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Arnaud Michel <michel.arnaud at cirad.fr>
Cc: R help <r-help at r-project.org>
Sent: Sunday, July 14, 2013 2:39 PM
Subject: Re: [R] simplify a dataframe

Hi,
May be this helps you.
df1$contrat[grep("^CDD",df1$contrat)]<- "CDD détaché ext. Cirad" 
df1[48,8]
[1] "31/12/4712" #strange value

df1[48,8]<- "31/12/2013"  #changed

indx<-as.numeric(interaction(df1[,1:6],drop=TRUE))
res<-do.call(rbind,lapply(split(df1,indx),function(x) {x1<- as.Date(x$Debut,format="%d/%m/%Y");x2<- as.Date(x$Fin,format="%d/%m/%Y");do.call(rbind,lapply(split(x,cumsum(c(FALSE,(x1[-1]-x2[-nrow(x)])!=1))),function(x) data.frame(x[1,1:6],Debut=head(x$Debut,1),Fin=tail(x$Fin,1),stringsAsFactors=FALSE)))}))

 res[order(res$Matricule),]  #the order of rows is a bit different than df2.
    Matricule    Nom     Sexe DateNaissance                contrat        Pays
5           1  VERON  Féminin    02/09/1935             CDI commun      France
4.0         6 BENARD Masculin    01/04/1935             CDI commun      France
4.1         6 BENARD Masculin    01/04/1935             CDI commun      France
10          6 BENARD Masculin    01/04/1935             CDI commun Philippines
6           8 DALNIC  Féminin    19/02/1940             CDI commun      France
9           8 DALNIC  Féminin    19/02/1940             CDI commun  Martinique
1         934  FORNI Masculin    10/07/1961 CDD détaché ext. Cirad    Cameroun
2         934  FORNI Masculin    10/07/1961             CDI commun       Congo
3         934  FORNI Masculin    10/07/1961    CDI Détachés Autres       Congo
7         934  FORNI Masculin    10/07/1961    CDI Détachés Autres      France
8         934  FORNI Masculin    10/07/1961             CDI commun       Gabon
         Debut        Fin
5   24/01/1995 31/12/1997
4.0 13/03/1995 30/06/1995
4.1 01/01/1996 31/01/1996
10  02/02/1995 12/03/1995
6   24/01/1995 31/08/1995
9   01/09/1995 29/02/2000
1   26/01/1995 31/08/2001
2   05/09/2012 31/12/2013
3   01/09/2004 31/08/2007
7   01/09/2001 31/08/2004
8   01/09/2007 04/09/2012

A.K.

________________________________
From: Arnaud Michel <michel.arnaud at cirad.fr>
To: arun <smartpink111 at yahoo.com> 
Cc: R help <r-help at r-project.org>; jholtman at gmail.com; Rui Barradas <ruipbarradas at sapo.pt> 
Sent: Sunday, July 14, 2013 12:17 PM
Subject: Re: [R] simplify a dataframe

Hi,
Excuse me for the indistinctness

Le 13/07/2013 17:18, arun a écrit :

Hi,
"when the value of Debut of lines i = value Fin of lines i-1"
That part is not clear esp. when it is looked upon with the expected output (df2).
I want to group the lines which have the same caracteristics (Matricule, Nom, Sexe, DateNaissance, Contrat, Pays) and with period of time (Debut/start and Fin/end) without interruption of time.
For exemple :
The following three lines
    :                                                                 
    Debut/Start  Fin/End  
1  VERON  Féminin    02/09/1935             CDI commun      France 24/01/1995 30/04/1997
1  VERON  Féminin    02/09/1935             CDI commun      France
    01/05/1997 30/12/1997
1  VERON  Féminin    02/09/1935             CDI commun      France
    31/12/1997 31/12/1997
are transformed into 1 line
1  VERON  Féminin    02/09/1935             CDI commun      France 24/01/1995 31/12/1997
because same caracteristicsand period of time without interruption
    of time (from 24/01/1995 to 31/12/1997)

The following six lines :
6 BENARD Masculin    01/04/1935             CDI commun Philippines 02/02/1995 27/02/1995  
6 BENARD Masculin    01/04/1935             CDI commun Philippines
    28/02/1995 28/02/1995 
6 BENARD Masculin    01/04/1935             CDI commun Philippines
    01/03/1995 12/03/1995
6 BENARD Masculin    01/04/1935             CDI commun      France 13/03/1995 30/06/1995
6 BENARD Masculin    01/04/1935             CDI commun      France 01/01/1996 30/01/1996
6 BENARD Masculin    01/04/1935             CDI commun      France
    31/01/1996 31/01/1996
are transformed into
6 BENARD Masculin    01/04/1935             CDI commun Philippines 02/02/1995 12/03/1995
6 BENARD Masculin    01/04/1935             CDI commun      France 13/03/1995 30/06/1995
6 BENARD Masculin    01/04/1935             CDI commun      France 01/01/1996 31/01/1996
because
lines 1-3 identical for caracteristics and without interruption in
    time
lines 4 and lines 5-6 are not grouped because there is an
    interruption in time beetween 30/06/1995 and 01/01/1996

Thank you for your help
Michel

  Also, in your example dataset: df1$contrat[grep("^CDD",df1$contrat)]
#[1] "CDD détaché ext. Cirad" "CDD détaché ext. Cirad" "CDD détaché ext. Cirad"
#[4] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. Cirad"
#[7] "CDD détaché ext. Cirad" "CDD détaché ext.Cirad"  "CDD détaché ext. Cirad"
##Looks like there are extra spaces in some of them.  I guess these are the same
df1$contrat[grep("^CDD",df1$contrat)]<- "CDD détaché ext. Cirad" I tried this:
indx<-as.numeric(interaction(df1[,1:6],drop=FALSE))  df1New<- df1
res2<-unique(within(df1New,{Debut<-ave(seq_along(indx),indx,FUN=function(x) Debut[head(x,1)]);Fin<- ave(seq_along(indx),indx,FUN=function(x) Fin[tail(x,1)])}))
 row.names(res2)<- 1:nrow(res2) res2[,c(1,2,7:8)]
   Matricule    Nom      Debut        Fin
1          1  VERON 24/01/1995 31/12/1997
2          6 BENARD 02/02/1995 12/03/1995
3          6 BENARD 13/03/1995 31/01/1996 ###here not correct
4          8 DALNIC 24/01/1995 31/08/1995
5          8 DALNIC 01/09/1995 29/02/2000
6        934  FORNI 26/01/1995 31/08/2001
7        934  FORNI 01/09/2001 31/08/2004
8        934  FORNI 01/09/2004 31/08/2007
9        934  FORNI 01/09/2007 04/09/2012
10       934  FORNI 05/09/2012 31/12/4712 df2[,c(1,2,7:8)]
   Mat    Nom      Debut        Fin
1    1  VERON 24/01/1995 31/12/1997
2    6 BENARD 02/02/1995 12/03/1995
3    6 BENARD 13/03/1995 30/06/1995
4    6 BENARD 01/01/1996 31/01/1996 #missing this row 
5    8 DALNIC 24/01/1995 31/08/1995
6    8 DALNIC 01/09/1995 29/02/2000
7  934  FORNI 26/01/1995 31/08/2001
8  934  FORNI 01/09/2001 31/08/2004
9  934  FORNI 01/09/2004 31/08/2007
10 934  FORNI 01/09/2007 04/09/2012
11 934  FORNI 05/09/2012 31/12/4712 Here, the dates look similar to the ones on df2 except for one row in df2. A.K. ----- Original Message -----
From: Arnaud Michel <michel.arnaud at cirad.fr> To: R help <r-help at r-project.org> Cc: 
Sent: Friday, July 12, 2013 3:45 PM
Subject: [R] simplify a dataframe Hello I have the following problem : group the lines of a dataframe when no 
information change (Matricule, Nom, Sexe, DateNaissance, Contrat, Pays) 
and when the value of Debut of lines i = value Fin of lines i-1
I can obtain it with a do loop. Is it possible to avoid the loop ? The dataframe initial is df1
dput(df1)
structure(list(Matricule = c(1L, 1L, 1L, 6L, 6L, 6L, 6L, 6L,
6L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 934L, 934L, 934L, 934L,
934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L, 934L,
934L, 934L, 934L, 934L), Nom = c("VERON", "VERON", "VERON", "BENARD",
"BENARD", "BENARD", "BENARD", "BENARD", "BENARD", "DALNIC", "DALNIC",
"DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC", "DALNIC",
"FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
"FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
"FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
"FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI",
"FORNI", "FORNI"), Sexe = c("Féminin", "Féminin", "Féminin",
"Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
"Féminin", "Féminin", "Féminin", "Féminin", "Féminin", "Féminin",
"Féminin", "Féminin", "Féminin", "Masculin", "Masculin", "Masculin",
"Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
"Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
"Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
"Masculin", "Masculin", "Masculin", "Masculin", "Masculin", "Masculin",
"Masculin", "Masculin", "Masculin"), DateNaissance = c("02/09/1935",
"02/09/1935", "02/09/1935", "01/04/1935", "01/04/1935", "01/04/1935",
"01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
"19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940", "19/02/1940",
"19/02/1940", "19/02/1940", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961",
"10/07/1961", "10/07/1961"), contrat = c("CDI commun", "CDI commun",
"CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
"CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
"CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
"CDI commun", "CDD détaché ext. Cirad", "CDD détaché ext. Cirad",
"CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext. 
Cirad",
"CDD détaché ext. Cirad", "CDD détaché ext. Cirad", "CDD détaché ext. 
Cirad",
"CDD détaché ext. Cirad", "CDI Détachés Autres", "CDI Détachés Autres",
"CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
"CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
"CDI Détachés Autres", "CDI Détachés Autres", "CDI Détachés Autres",
"CDI Détachés Autres", "CDI Détachés Autres", "CDI commun", "CDI commun",
"CDI commun", "CDI commun", "CDI commun", "CDI commun", "CDI commun",
"CDI commun"), Pays = c("France", "France", "France", "Philippines",
"Philippines", "Philippines", "France", "France", "France", "France",
"France", "Martinique", "Martinique", "Martinique", "Martinique",
"Martinique", "Martinique", "Martinique", "Cameroun", "Cameroun",
"Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun", "Cameroun",
"Cameroun", "France", "France", "France", "France", "France",
"France", "France", "Congo", "Congo", "Congo", "Congo", "Congo",
"Congo", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon", "Gabon",
"Congo", "Congo"), Debut = c("24/01/1995", "01/05/1997", "31/12/1997",
"02/02/1995", "28/02/1995", "01/03/1995", "13/03/1995", "01/01/1996",
"31/01/1996", "24/01/1995", "01/07/1995", "01/09/1995", "01/07/1997",
"01/01/1998", "01/08/1998", "01/01/2000", "17/01/2000", "29/02/2000",
"26/01/1995", "01/07/1996", "16/09/1997", "01/01/1998", "01/07/1998",
"04/11/1999", "01/01/2001", "01/04/2001", "31/08/2001", "01/09/2001",
"02/09/2001", "01/12/2001", "01/02/2003", "01/04/2003", "01/01/2004",
"01/03/2004", "01/09/2004", "01/01/2005", "01/04/2005", "28/10/2006",
"01/01/2007", "01/04/2007", "01/09/2007", "01/01/2009", "01/04/2009",
"01/01/2010", "01/01/2011", "01/04/2011", "05/09/2012", "01/01/2013"
), Fin = c("30/04/1997", "30/12/1997", "31/12/1997", "27/02/1995",
"28/02/1995", "12/03/1995", "30/06/1995", "30/01/1996", "31/01/1996",
"30/06/1995", "31/08/1995", "30/06/1997", "31/12/1997", "31/07/1998",
"31/12/1999", "16/01/2000", "28/02/2000", "29/02/2000", "30/06/1996",
"15/09/1997", "31/12/1997", "30/06/1998", "03/11/1999", "31/12/2000",
"31/03/2001", "30/08/2001", "31/08/2001", "01/09/2001", "30/11/2001",
"31/01/2003", "31/03/2003", "31/12/2003", "29/02/2004", "31/08/2004",
"31/12/2004", "31/03/2005", "27/10/2006", "31/12/2006", "31/03/2007",
"31/08/2007", "31/12/2008", "31/03/2009", "31/12/2009", "31/12/2010",
"31/03/2011", "04/09/2012", "31/12/2012", "31/12/4712")), .Names = 
c("Matricule",
"Nom", "Sexe", "DateNaissance", "contrat", "Pays", "Debut", "Fin"
), class = "data.frame", row.names = c(NA, -48L)) The dataframe to be obtained is df2
dput(df2)
structure(list(Mat = c(1L, 6L, 6L, 6L, 8L, 8L, 934L, 934L, 934L,
934L, 934L), Nom = c("VERON", "BENARD", "BENARD", "BENARD", "DALNIC",
"DALNIC", "FORNI", "FORNI", "FORNI", "FORNI", "FORNI"), Sexe = c("Féminin",
"Masculin", "Masculin", "Masculin", "Féminin", "Féminin", "Masculin",
"Masculin", "Masculin", "Masculin", "Masculin"), DateNaissance = 
c("02/09/1935",
"01/04/1935", "01/04/1935", "01/04/1935", "19/02/1940", "19/02/1940",
"10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961", "10/07/1961"
), contrat = c("CDI commun", "CDI commun", "CDI commun", "CDI commun",
"CDI commun", "CDI commun", "CDD détaché ext. Cirad", "CDI Détachés 
Autres",
"CDI Détachés Autres", "CDI commun", "CDI commun"), Pays = c("France",
"Philippines", "France", "France", "France", "Martinique", "Cameroun",
"France", "Congo", "Gabon", "Congo"), Debut = c("24/01/1995",
"02/02/1995", "13/03/1995", "01/01/1996", "24/01/1995", "01/09/1995",
"26/01/1995", "01/09/2001", "01/09/2004", "01/09/2007", "05/09/2012"
), Fin = c("31/12/1997", "12/03/1995", "30/06/1995", "31/01/1996",
"31/08/1995", "29/02/2000", "31/08/2001", "31/08/2004", "31/08/2007",
"04/09/2012", "31/12/4712")), .Names = c("Mat", "Nom", "Sexe",
"DateNaissance", "contrat", "Pays", "Debut", "Fin"), class = 
"data.frame", row.names = c(NA,
-11L)) Thank you for your help 

-- 
Michel ARNAUD
Chargé de mission auprès du DRH
DGDRD-Drh - TA 174/04
Av Agropolis 34398 Montpellier cedex 5
tel : 04.67.61.75.38  
fax : 04.67.61.57.87
port: 06.47.43.55.31