[R] How might I work with a data.frame where each physical row represents several logical rows?
David Winsemius
dwinsemius at comcast.net
Thu Jan 5 05:33:46 CET 2017
> On Jan 4, 2017, at 7:40 PM, David Wolfskill <r at catwhisker.org> wrote:
>
> I have (time series) data extracted from a repository that is stored
> such that each record is for an hour, but each record contains an
> ordered set of values throughout the hour. In the following exmaple,
> I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each
> "start" point (respectively):
>
> test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L,
> 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L,
> 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L,
> 1482793200L, 1482793200L), hostname = c("c001.example.net", "c001.example.net",
> "c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net",
> "c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net",
> "c001.example.net", "c161.example.net", "c161.example.net", "c161.example.net",
> "c161.example.net"), mtype = c("health", "health", "net", "net",
> "net", "net", "net", "net", "net", "sys", "sys", "net", "sys",
> "sys", "sys"), limit_type = c("fill", "serve", "", "", "", "",
> "", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16,
> 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21,
> 1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1",
> "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1",
> "2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13",
> "2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing",
> "fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""),
> value_type = c("limit", "limit", "", "", "", "", "", "",
> "", "", "", "", "", "", ""), nic = c("all", "all", "", "",
> "", "", "", "", "", "", "", "mce0", "", "", ""), name = c("in_download_window",
> "in_download_window", "tcpOutSegs", "tcpRetransSegs", "tcp_connection_count",
> "tcp_connection_count", "tcp_connection_count", "tcp_connection_count",
> "tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets",
> "CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]",
> "[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, 260579.2186617577, 258763.2815145043]",
> "[18436.311524062934, 18248.952271420356, 18201.62259198662, 17818.39529178736]",
> "[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, 729.0]",
> "[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]",
> "[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]",
> "[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, 4.0]"
> )), .Names = c("start", "hostname", "mtype", "limit_type",
> "hw", "fw", "tcp_state", "value_type", "nic", "name", "values"
> ), class = "data.frame", row.names = c(NA, -15L))
>
>
> So of the 15 rows in the above example, row 8 (which depicts the TCP
> connection counts in the "last ACK" state) has the values:
>
> * 1162.0
> * 1192.0
> * 1148.0
> * 1110.0
>
> It seems to me that what is wanted is for each of the existing rows to be
> replaced by a set of 4 (in this case) rows, where the other columns are
> the same (save for "start", which is a timestamp, and should be adjusted
> for the respective times).
Perhaps something like this:
# function to read the values in 'values':
parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }
# the apply function reads line-by-line
new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), nvals <- parse_values(d['values']) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
# Could suppress the report from scan by adding quiet = TRUE
# now take this list of 4 line data.frames and "rbind" them
# If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
> new_df <- do.call("rbind", new_dat)
> head(new_df)
start hostname mtype limit_type hw fw tcp_state value_type nic
1 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
2 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
3 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
4 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
5 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all
6 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all
name nvals....parse_values.d..values...
1 in_download_window 0
2 in_download_window 0
3 in_download_window 0
4 in_download_window 0
5 in_download_window 0
6 in_download_window 0
str(new_df)
'data.frame': 60 obs. of 11 variables:
$ start : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
$ hostname : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
$ mtype : Factor w/ 3 levels "health","net",..: 1 1 1 1 1 1 1 1 2 2 ...
$ limit_type : Factor w/ 3 levels "fill","serve",..: 1 1 1 1 2 2 2 2 3 3 ...
$ hw : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
$ fw : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
$ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value_type : Factor w/ 2 levels "limit","": 1 1 1 1 1 1 1 1 2 2 ...
$ nic : Factor w/ 3 levels "all","","mce0": 1 1 1 1 1 1 1 1 2 2 ...
$ name : Factor w/ 8 levels "in_download_window",..: 1 1 1 1 1 1 1 1 2 2 ...
$ nvals....parse_values.d..values...: num 0 0 0 0 0 ...
>
> I'm fairly sure I can write code to do that, but it would end up being
> something like Perl implemented in R, which seems fairly grotesque: I
> can't help but think that there ought to be a ... more elegant approach
> in R (which is why I am sking for help).
>
> (I will also end up collecting all of the records for a given timestamp
> and hostname, and creating one very wide record with all of the data
> from the set of records thus found. I already have (yes, Perl) code to
> do this -- though if there's a reasonable way to avoid that, I'm
> interested.)
I thought you wanted the data in long form.
>
> Once that's all done, I'll be examining various columns, subsetting
> by attributes of the systems being compared -- but I already have
> code to do that (that makes use of a different -- and rather more
> fragile -- approach for extracting the data from its repository).
>
> Thanks!
>
> Peace,
> david
> --
> David H. Wolfskill r at catwhisker.org
> Epistemology for post-truthers: How do we select parts of reality to ignore?
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list