[R] data.frame: data-driven column selections that vary by row??

John Kane jrkrideau at inbox.com
Tue Mar 31 17:11:28 CEST 2015


I think we need some data and code 
Reproducibility
https://github.com/hadley/devtools/wiki/Reproducibility
 http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example



John Kane
Kingston ON Canada


> -----Original Message-----
> From: r at catwhisker.org
> Sent: Mon, 30 Mar 2015 06:50:59 -0700
> To: r-help at r-project.org
> Subject: [R] data.frame: data-driven column selections that vary by row??
> 
> Sorry if that's confusing: I'm probably confused. :-(
> 
> I am collecting and trying to analyze data regarding performance of
> computer systems.
> 
> After extracting the data from its repository, I have created and
> used a Perl script to generate a (relatively) simple CSV, each
> record of which contains:
> * a POSIXct timestamp
> * a hostname
> * a collection of metrics for the interval identified by the timestamp,
>   and specific to the host in question, as well as some factors to
>   group the hosts (e.g., whether it's in a "control" vs. a "test"
>   group; a broad categorization of how the host is provisioned; which
>   version of the software it was running at the time...).  (Each
>   metric and factor is in a uniquely-named column.)
> 
> As extracted from the repository, there were several records for each
> such hostname/timestamp pair -- e.g., there would be separate records
> for:
> * Input bandwidth utilization for network interface 1
> * Output bandwidth utilization for network interface 1
> * Input bandwidth utilization for network interface 2
> * Output bandwidth utilization for network interface 2
> 
> (And the same field would be used for each of these -- the
> interpretation being driven by the content of other fields in teh
> record.)
> 
> Working with the data as described (immediately) above directly in R
> seemed... daunting, at best: thus the excursion into Perl.
> 
> And for some of the data, what I have works well enough.
> 
> But now I also want to analyze information from disk drives, and things
> get messy (as far as I can see).
> 
> First, each disk drive has a collection of 17 metrics (such as
> "busy_pct", "kb_per_transfer_read", and "transfers_per_second_write"),
> as well as a factor ("dev_type").  Each also has a device name that is
> unique within the host where it resides (e.g. "da1", "da2", "da3"....).
> (The "dev_type" factor identifies whether the drive is a solid-state
> device or a spinning disk.)
> 
> I have thus made the corresponding columns unique by pasting the drive
> name and the name of the metric (or factor), separating the two with
> "_" (e.g. "da7_busy_pct"; "ada0_mb_per_second_write";
> "ada4_queue_length").  I am not certain that's the best thing I could
> have done -- and I'm open to changing the approach.
> 
> The challenge for me is that different (classes of) machines are
> provisioned differently; some consequennces of that:
> * While da1 may be a spinning disk on host A, that has no bearing on
>   whether or not the "da1" on host B is a spinning disk or an SSD.
> * Host C may not even have a "da1" device.
> * Host D may be of a type that normally has a "da1," but in this case,
>   the drive has failed and has been disabled (so host D won't report
>   anything about "da1").
> 
> (I'm not too bothered about the "non-reporting" case, but cite it so we
> all know about it.)
> 
> I expect I will want to be using groupings:
> * All disk devices -- this one is easy.
> * All SSD devices (excluding spinning disks).
> * All spinning disks (excluding SSDs).
> 
> I'm having trouble with the latter two (though, certainly, if I solve
> one, the other is also solved).
> 
> Also, for some  of the metrics, I will want to sum them; for others,
> I will want to do other things -- find minima or maxima, or average
> them.  So pre-calculating such aggregates in the Perl script isn't
> something that appeals to me.
> 
> Finally (as far as complications go), I'm trying to write the code in
> such a way that if we deploy a new configuration of machine that has
> (say) twice as many drives as the biggest one we presently deploy, the
> code Just Works -- I shouldn't need to update the code merely to adapt
> to another hardware configuration.
> 
> I have been able to write a function that takes the data.frame obtained
> by reading the above-cited CSV, and generates a data.frame with a row
> for each host, and depicts the "dev_type" for each device for that host;
> here's an abbreviated (and slightly redacted) copy of its output to
> illustrate some of the above:
> 
>        ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36
> da3
> host_A  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd
> hdd
> host_B  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd
> hdd
> host_G  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_H  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_M  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_N  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> 
> (That function is written with the explicit assumption(!) that for the
> period covered by a given set of input data, a given host's
> configuration remains static: we won't have drives changing type
> mid-stream.)
> 
> So the point of this lengthy(!) note is to ask if there's a
> somewhat-sane way to be able to group the metrics for the "ssd" devices
> (for example), given the above.
> 
> (So far, the least obnoxious way that comes to mind is to actually
> create 2 columns for each device metric: one for the device if it's an
> "ssd";l the other for "hdd" -- so instead of columns such as:
> * da3_busy_pct
> * da3_dev_type
> * da3_kb_per_transfer_read
> * da36_cam_timeouts
> * da36_dev_type
> * da36_mb_per_second_read
> 
> I would have:
> * da3_hdd_busy_pct
> * da3_ssd_busy_pct
> * da3_hdd_dev_type
> * da3_ssd_dev_type
> * da3_hdd_kb_per_transfer_read
> * da3_ssd_kb_per_transfer_read
> * da36_hdd_cam_timeouts
> * da36_ssd_cam_timeouts
> * da36_hdd_dev_type
> * da36_ssd_dev_type
> * da36_hdd_mb_per_second_read
> * da36_ssd_mb_per_second_read
> 
> and no more than half of those would actually be populated (depending on
> the content of "dev_type" when the Perl script is creating the CSV).
> 
> That seems rather hackish, though.
> 
> Thank you in advance for any insight.
> 
> Peace,
> david
> --
> David H. Wolfskill				r at catwhisker.org
> Those who murder in the name of God or prophet are blasphemous cowards.
> 
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.

____________________________________________________________
Can't remember your password? Do you need a strong and secure password?
Use Password manager! It stores your passwords & protects your account.



More information about the R-help mailing list