[R] How might I work with a data.frame where each physical row represents several logical rows?
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Thu Jan 5 08:29:16 CET 2017
Below is my interpretation of one way to achieve your stated goal. I don't
know what analysis you plan to perform, but this seems unlikely to be be
my preferred analysis format (I think I would probably analyze subsets of
the records related to specific parts of the transactions).
library(dplyr)
library(tidyr)
fifteenminutes <- 15 * 60
# for education, show intermediate results
# strip out square brackets
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
)
# split the single column into multiple columns
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
)
# pull separate value columns into one column called value, with a new
# column vcol to hold the name of the original column
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
%>% gather( vcol, value, c( value0, value1, value2, value3 ) )
)
# create a timestamp column for the individual values
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
%>% gather( vcol, value, c( value0, value1, value2, value3 ) )
%>% mutate( timestamp = start
+ fifteenminutes
* as.numeric( sub( "value", "", vcol ) ) )
)
# remove the old vcol column now that timestamp column is created
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
%>% gather( vcol, value, c( value0, value1, value2, value3 ) )
%>% mutate( timestamp = start
+ fifteenminutes
* as.numeric( sub( "value", "", vcol ) ) )
%>% select( -vcol )
)
# unite several columns that currently distinguish various rows
View( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
%>% gather( vcol, value, c( value0, value1, value2, value3 ) )
%>% mutate( timestamp = start
+ fifteenminutes
* as.numeric( sub( "value", "", vcol ) ) )
%>% select( -vcol )
%>% unite( mname, mtype, nic, tcp_state, limit_type, value_type, name )
)
# spread values out into separate columns
test_data2 <- ( test_data
%>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
%>% separate( values, paste0( "value", 0:3 ), ", *" )
%>% gather( vcol, value, c( value0, value1, value2, value3 )
)
%>% mutate( timestamp = start + fifteenminutes * as.numeric(
sub( "value", "", vcol ) ) )
%>% select( -vcol )
%>% unite( mname, mtype, nic, tcp_state, limit_type,
value_type, name )
%>% spread( mname, value )
)
View( test_data2 )
On Wed, 4 Jan 2017, David Wolfskill wrote:
> On Wed, Jan 04, 2017 at 08:33:46PM -0800, David Winsemius wrote:
>> ...
>> Perhaps something like this:
>>
>> # function to read the values in 'values':
>> parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }
>>
>> # the apply function reads line-by-line
>> new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), nvals <- parse_values(d['values']) ) )
>
> Hmmm.... OK; that looks a lot better than the stuff that was coming to
> my mind -- thanks! :-)
>
>> ...
>> # Could suppress the report from scan by adding quiet = TRUE
>> # now take this list of 4 line data.frames and "rbind" them
>> # If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
>>> new_df <- do.call("rbind", new_dat)
>
> Aye.
>
>> ...
>>> (I will also end up collecting all of the records for a given timestamp
>>> and hostname, and creating one very wide record with all of the data
>>> from the set of records thus found. I already have (yes, Perl) code to
>>> do this -- though if there's a reasonable way to avoid that, I'm
>>> interested.)
>>
>> I thought you wanted the data in long form.
>
> Sorry; I'm not understanding what you mean: My background is a lot more
> toward systems administration than statistical analysis.
>
> The repository I'm using has a rather large number of individual metrics
> from a given server -- each provided on a separate row. (That's why one
> of the columns is called "name" -- it provides the (base) "name" of the
> metric that corresponds to the "values" on the given row.) I'll plan to
> assemble the rows for a given server & timestamp into a single row --
> thuse, I would have the tcp_connection_count for the "last ACK" state
> and for the "fin_wait_2" state, as well as CpuSystem, CpuUser, CpuIdle,
> ... for the given server & timestamp on a single row (eventually).
>
>> ...
>
> Thanks again!
>
> Peace,
> david
> --
> David H. Wolfskill r at catwhisker.org
> Epistemology for post-truthers: How do we select parts of reality to ignore?
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list