[R] How might I work with a data.frame where each physical row represents several logical rows?
David Winsemius
dwinsemius at comcast.net
Thu Jan 5 06:45:10 CET 2017
> On Jan 4, 2017, at 8:51 PM, David Wolfskill <r at catwhisker.org> wrote:
>
> On Wed, Jan 04, 2017 at 08:33:46PM -0800, David Winsemius wrote:
>> ...
>> Perhaps something like this:
>>
>> # function to read the values in 'values':
>> parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }
>>
>> # the apply function reads line-by-line
>> new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), nvals <- parse_values(d['values']) ) )
>
> Hmmm.... OK; that looks a lot better than the stuff that was coming to
> my mind -- thanks! :-)
>
>> ...
>> # Could suppress the report from scan by adding quiet = TRUE
>> # now take this list of 4 line data.frames and "rbind" them
>> # If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
>>> new_df <- do.call("rbind", new_dat)
>
> Aye.
>
>> ...
>>> (I will also end up collecting all of the records for a given timestamp
>>> and hostname, and creating one very wide record with all of the data
>>> from the set of records thus found. I already have (yes, Perl) code to
>>> do this -- though if there's a reasonable way to avoid that, I'm
>>> interested.)
>>
>> I thought you wanted the data in long form.
>
> Sorry; I'm not understanding what you mean: My background is a lot more
> toward systems administration than statistical analysis.
>
> The repository I'm using has a rather large number of individual metrics
> from a given server -- each provided on a separate row. (That's why one
> of the columns is called "name" -- it provides the (base) "name" of the
> metric that corresponds to the "values" on the given row.) I'll plan to
> assemble the rows for a given server & timestamp into a single row --
> thuse, I would have the tcp_connection_count for the "last ACK" state
> and for the "fin_wait_2" state, as well as CpuSystem, CpuUser, CpuIdle,
> ... for the given server & timestamp on a single row (eventually).
If you wanted it in wide form, you could just join the individual id columns to a named list made from the parsed 'values'.
> new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), as.list( setNames( parse_values(d['values']), paste0( "V", 1:4) ) ) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
> new_df <- do.call("rbind", new_dat)
>
> str(new_df)
'data.frame': 15 obs. of 14 variables:
$ start : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
$ hostname : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
$ mtype : Factor w/ 3 levels "health","net",..: 1 1 2 2 2 2 2 2 2 3 ...
$ limit_type: Factor w/ 3 levels "fill","serve",..: 1 2 3 3 3 3 3 3 3 3 ...
$ hw : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
$ fw : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
$ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 2 3 4 5 6 1 ...
$ value_type: Factor w/ 2 levels "limit","": 1 1 2 2 2 2 2 2 2 2 ...
$ nic : Factor w/ 3 levels "all","","mce0": 1 1 2 2 2 2 2 2 2 2 ...
$ name : Factor w/ 8 levels "in_download_window",..: 1 1 2 3 4 4 4 4 4 5 ...
$ V1 : num 0 0 260411 18436 5 ...
$ V2 : num 0 0 258470 18249 3 ...
$ V3 : num 0 0 260579 18202 3 ...
$ V4 : num 0 0 258763 17818 3 ...
>
>> ...
>
> Thanks again!
>
> Peace,
> david
> --
> David H. Wolfskill r at catwhisker.org
> Epistemology for post-truthers: How do we select parts of reality to ignore?
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list