[R] How might I work with a data.frame where each physical row represents several logical rows?

Thu Jan 5 05:33:46 CET 2017

> On Jan 4, 2017, at 7:40 PM, David Wolfskill <r at catwhisker.org> wrote:
> 
> I have (time series) data extracted from a repository that is stored
> such that each record is for an hour, but each record contains an
> ordered set of values throughout the hour.  In the following exmaple,
> I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each
> "start" point (respectively):
> 
> test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L, 
> 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
> 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
> 1482793200L, 1482793200L), hostname = c("c001.example.net", "c001.example.net", 
> "c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
> "c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
> "c001.example.net", "c161.example.net", "c161.example.net", "c161.example.net", 
> "c161.example.net"), mtype = c("health", "health", "net", "net", 
> "net", "net", "net", "net", "net", "sys", "sys", "net", "sys", 
> "sys", "sys"), limit_type = c("fill", "serve", "", "", "", "", 
> "", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16, 
> 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21, 
> 1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
> "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
> "2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13", 
> "2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing", 
> "fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""), 
>    value_type = c("limit", "limit", "", "", "", "", "", "", 
>    "", "", "", "", "", "", ""), nic = c("all", "all", "", "", 
>    "", "", "", "", "", "", "", "mce0", "", "", ""), name = c("in_download_window", 
>    "in_download_window", "tcpOutSegs", "tcpRetransSegs", "tcp_connection_count", 
>    "tcp_connection_count", "tcp_connection_count", "tcp_connection_count", 
>    "tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets", 
>    "CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]", 
>    "[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, 260579.2186617577, 258763.2815145043]", 
>    "[18436.311524062934, 18248.952271420356, 18201.62259198662, 17818.39529178736]", 
>    "[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, 729.0]", 
>    "[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]", 
>    "[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]", 
>    "[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, 4.0]"
>    )), .Names = c("start", "hostname", "mtype", "limit_type", 
> "hw", "fw", "tcp_state", "value_type", "nic", "name", "values"
> ), class = "data.frame", row.names = c(NA, -15L))
> 
> 
> So of the 15 rows in the above example, row 8 (which depicts the TCP
> connection counts in the "last ACK" state) has the values:
> 
> * 1162.0
> * 1192.0
> * 1148.0
> * 1110.0
> 
> It seems to me that what is wanted is for each of the existing rows to be
> replaced by a set of 4 (in this case) rows, where the other columns are
> the same (save for "start", which is a timestamp, and should be adjusted
> for the respective times).

Perhaps something like this:

# function to read the values in 'values':
 parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }

# the apply function reads line-by-line
 new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d)  %in% "values"]), nvals <- parse_values(d['values']) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items

# Could suppress the report from scan by adding quiet = TRUE
# now take this list of 4 line data.frames and "rbind" them
# If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
> new_df <- do.call("rbind", new_dat)
> head(new_df)
       start         hostname  mtype limit_type   hw         fw tcp_state value_type nic
1 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
2 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
3 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
4 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
5 1482793200 c001.example.net health      serve 1.16 2017Q1.1.1                limit all
6 1482793200 c001.example.net health      serve 1.16 2017Q1.1.1                limit all
                name nvals....parse_values.d..values...
1 in_download_window                                  0
2 in_download_window                                  0
3 in_download_window                                  0
4 in_download_window                                  0
5 in_download_window                                  0
6 in_download_window                                  0

str(new_df)
'data.frame':	60 obs. of  11 variables:
 $ start                             : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
 $ hostname                          : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ mtype                             : Factor w/ 3 levels "health","net",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ limit_type                        : Factor w/ 3 levels "fill","serve",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ hw                                : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
 $ fw                                : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
 $ tcp_state                         : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value_type                        : Factor w/ 2 levels "limit","": 1 1 1 1 1 1 1 1 2 2 ...
 $ nic                               : Factor w/ 3 levels "all","","mce0": 1 1 1 1 1 1 1 1 2 2 ...
 $ name                              : Factor w/ 8 levels "in_download_window",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ nvals....parse_values.d..values...: num  0 0 0 0 0 ...

> 

> I'm fairly sure I can write code to do that, but it would end up being
> something like Perl implemented in R, which seems fairly grotesque: I
> can't help but think that there ought to be a ... more elegant approach
> in R (which is why I am sking for help).
> 
> (I will also end up collecting all of the records for a given timestamp
> and hostname, and creating one very wide record with all of the data
> from the set of records thus found.  I already have (yes, Perl) code to
> do this -- though if there's a reasonable way to avoid that, I'm
> interested.)

I thought you wanted the data in long form.
> 
> Once that's all done, I'll be examining various columns, subsetting
> by attributes of the systems being compared -- but I already have
> code to do that (that makes use of a different -- and rather more
> fragile -- approach for extracting the data from its repository).
> 
> Thanks!
> 
> Peace,
> david
> -- 
> David H. Wolfskill				r at catwhisker.org
> Epistemology for post-truthers: How do we select parts of reality to ignore?
> 
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA