[R] R issue with unequal large data frames with multiple columns

Adeel - SafeGreenCapital adeel.amin at gmail.com
Thu May 2 17:25:32 CEST 2013


Thank you Arun (and everyone else)-- this is in the right the direction.
I"ll post the code that worked shortly for everyone else in case you were
curious.

-----Original Message-----
From: arun [mailto:smartpink111 at yahoo.com] 
Sent: Thursday, May 02, 2013 7:09 AM
To: Adeel Amin
Cc: R help
Subject: Re: [R] R issue with unequal large data frames with multiple
columns

Hi,May be this helps:


dat1<-structure(list(X.DATE = c("01052007", "01072007", "01072007", 
"02182007", "02182007", "02242007", "03252007"), X.TIME = c("0230", 
"0330", "0440", "0440", "0440", "0330", "0230"), VALUE = c(37, 
42, 45, 45, 45, 42, 45), VALUE2 = c(29, 24, 28, 27, 35, 32, 32
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame",
row.names = c(NA, 
-7L))
dat2<- structure(list(X.DATE = c("01052007", "01182007", "01242007", 
"02142007", "02182007", "03242007", "03252007"), X.TIME = c("0230", 
"0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(34, 
41, 42, 44, 45, 21, 42), VALUE2 = c(28, 25, 26, 28, 32, 35, 36
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame",
row.names = c(NA, 
-7L))
dat3<- structure(list(X.DATE = c("01052007", "01182007", "01252007", 
"02142007", "02182007", "03222007", "03252007"), X.TIME = c("0230", 
"0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(32, 
42, 44, 44, 47, 42, 46), VALUE2 = c(24, 29, 32, 34, 38, 39, 42
)), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame",
row.names = c(NA, 
-7L))


library(xts)
lst1<-lapply(list(dat1,dat2,dat3),function(x){ xts(x[,-c(1,2)],
order.by=as.POSIXct(paste0(x[,1],x[,2]),format="%m%d%Y%H%M"))})

#subset by date and time
 lapply(lst1,function(x) x['2007-01-05 02:30:00/2007-01-25 04:30:00'])
#[[1]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    37     29
#2007-01-07 03:30:00    42     24
#2007-01-07 04:40:00    45     28
#
#[[2]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    34     28
#2007-01-18 03:30:00    41     25
#2007-01-24 04:30:00    42     26
#
#[[3]]
#                    VALUE VALUE2
#2007-01-05 02:30:00    32     24
#2007-01-18 03:30:00    42     29
#2007-01-25 04:30:00    44     32

#subset by time
lapply(lst1,function(x) x['T02:30/T03:30'])

res<-na.omit(Reduce(function(...) merge(...),lst1))
res
#                    VALUE VALUE2 VALUE.1 VALUE2.1 VALUE.2 VALUE2.2
#2007-01-05 02:30:00    37     29      34       28      32       24
#2007-02-18 04:40:00    45     27      45       32      47       38
#2007-03-25 02:30:00    45     32      42       36      46       42

lst2<-as.list(res)
lst3<-
lapply(list(c("VALUE","VALUE2"),c("VALUE.1","VALUE2.1"),c("VALUE.2","VALUE2.
2")),function(x) do.call(cbind,lst2[x]))
#or
lst3<-
lapply(split(names(lst2),((seq_along(names(lst2))-1)%/%2)+1),function(x)
do.call(cbind,lst2[x])) #change according to the number of columns

lst3
#$`1`
#                    VALUE VALUE2
#2007-01-05 02:30:00    37     29
#2007-02-18 04:40:00    45     27
#2007-03-25 02:30:00    45     32
#
#$`2`
#                    VALUE.1 VALUE2.1
#2007-01-05 02:30:00      34       28
#2007-02-18 04:40:00      45       32
#2007-03-25 02:30:00      42       36
#
#$`3`
#                    VALUE.2 VALUE2.2
#2007-01-05 02:30:00      32       24
#2007-02-18 04:40:00      47       38
#2007-03-25 02:30:00      46       42
A.K.




----- Original Message -----
From: Adeel Amin <adeel.amin at gmail.com>
To: r-help at r-project.org
Cc: 
Sent: Thursday, May 2, 2013 2:28 AM
Subject: [R] R issue with unequal large data frames with multiple columns

I'm a bit of an amateur R programmer.  I can do simple R scenarios but my
handle on complex grammatical issues isn't steady.

I have 12 CSV files that I've read into dataframes.  Each has 8 columns and
over 2000000 rows.  Each dataframe has data associated by time component
and a date component in the format of:

X.DATE and then X.TIME

X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM.  The issue
is that even though each dataframe begins and ends with the same X.DATE and
X.TIME values, each data frame has different number of rows.  One may have
as many 100000 rows more than the other.

I want to do two things:

1) I want to extract a certain portion of data depending on date and time
(easy)

2) In lock step with number 2 I want to eliminate values from the data
frame that are a) redundant or b) do not appear in the other data sets.

When step 2 is done, all the time/date data within all 12 dataframes will
be the same.

Suggestions?  Thanks R Community --

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list