[R] How might I work with a data.frame where each physical row represents several logical rows?

David Winsemius dwinsemius at comcast.net
Thu Jan 5 06:45:10 CET 2017


> On Jan 4, 2017, at 8:51 PM, David Wolfskill <r at catwhisker.org> wrote:
> 
> On Wed, Jan 04, 2017 at 08:33:46PM -0800, David Winsemius wrote:
>> ...
>> Perhaps something like this:
>> 
>> # function to read the values in 'values':
>> parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }
>> 
>> # the apply function reads line-by-line
>> new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d)  %in% "values"]), nvals <- parse_values(d['values']) ) )
> 
> Hmmm.... OK; that looks a lot better than the stuff that was coming to
> my mind -- thanks! :-)
> 
>> ...
>> # Could suppress the report from scan by adding quiet = TRUE
>> # now take this list of 4 line data.frames and "rbind" them
>> # If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
>>> new_df <- do.call("rbind", new_dat)
> 
> Aye.
> 
>> ...
>>> (I will also end up collecting all of the records for a given timestamp
>>> and hostname, and creating one very wide record with all of the data
>>> from the set of records thus found.  I already have (yes, Perl) code to
>>> do this -- though if there's a reasonable way to avoid that, I'm
>>> interested.)
>> 
>> I thought you wanted the data in long form.
> 
> Sorry; I'm not understanding what you mean: My background is a lot more
> toward systems administration than statistical analysis.
> 
> The repository I'm using has a rather large number of individual metrics
> from a given server -- each provided on a separate row.  (That's why one
> of the columns is called "name" -- it provides the (base) "name" of the
> metric that corresponds to the "values" on the given row.)  I'll plan to
> assemble the rows for a given server & timestamp into a single row --
> thuse, I would have the tcp_connection_count for the "last ACK" state
> and for the "fin_wait_2" state, as well as CpuSystem, CpuUser, CpuIdle,
> ... for the given server & timestamp on a single row (eventually).

If you wanted it in wide form, you could just join the individual id columns to a named list made from the parsed 'values'.

> new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d)  %in% "values"]), as.list( setNames( parse_values(d['values']), paste0( "V", 1:4) ) ) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
> new_df <- do.call("rbind", new_dat)
> 
> str(new_df)
'data.frame':	15 obs. of  14 variables:
 $ start     : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
 $ hostname  : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ mtype     : Factor w/ 3 levels "health","net",..: 1 1 2 2 2 2 2 2 2 3 ...
 $ limit_type: Factor w/ 3 levels "fill","serve",..: 1 2 3 3 3 3 3 3 3 3 ...
 $ hw        : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
 $ fw        : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
 $ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 2 3 4 5 6 1 ...
 $ value_type: Factor w/ 2 levels "limit","": 1 1 2 2 2 2 2 2 2 2 ...
 $ nic       : Factor w/ 3 levels "all","","mce0": 1 1 2 2 2 2 2 2 2 2 ...
 $ name      : Factor w/ 8 levels "in_download_window",..: 1 1 2 3 4 4 4 4 4 5 ...
 $ V1        : num  0 0 260411 18436 5 ...
 $ V2        : num  0 0 258470 18249 3 ...
 $ V3        : num  0 0 260579 18202 3 ...
 $ V4        : num  0 0 258763 17818 3 ...



> 
>> ...
> 
> Thanks again!
> 
> Peace,
> david
> -- 
> David H. Wolfskill				r at catwhisker.org
> Epistemology for post-truthers: How do we select parts of reality to ignore?
> 
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list