[R] data.frame: data-driven column selections that vary by row??
David Wolfskill
r at catwhisker.org
Mon Mar 30 15:50:59 CEST 2015
Sorry if that's confusing: I'm probably confused. :-(
I am collecting and trying to analyze data regarding performance of
computer systems.
After extracting the data from its repository, I have created and
used a Perl script to generate a (relatively) simple CSV, each
record of which contains:
* a POSIXct timestamp
* a hostname
* a collection of metrics for the interval identified by the timestamp,
and specific to the host in question, as well as some factors to
group the hosts (e.g., whether it's in a "control" vs. a "test"
group; a broad categorization of how the host is provisioned; which
version of the software it was running at the time...). (Each
metric and factor is in a uniquely-named column.)
As extracted from the repository, there were several records for each
such hostname/timestamp pair -- e.g., there would be separate records
for:
* Input bandwidth utilization for network interface 1
* Output bandwidth utilization for network interface 1
* Input bandwidth utilization for network interface 2
* Output bandwidth utilization for network interface 2
(And the same field would be used for each of these -- the
interpretation being driven by the content of other fields in teh
record.)
Working with the data as described (immediately) above directly in R
seemed... daunting, at best: thus the excursion into Perl.
And for some of the data, what I have works well enough.
But now I also want to analyze information from disk drives, and things
get messy (as far as I can see).
First, each disk drive has a collection of 17 metrics (such as
"busy_pct", "kb_per_transfer_read", and "transfers_per_second_write"),
as well as a factor ("dev_type"). Each also has a device name that is
unique within the host where it resides (e.g. "da1", "da2", "da3"....).
(The "dev_type" factor identifies whether the drive is a solid-state
device or a spinning disk.)
I have thus made the corresponding columns unique by pasting the drive
name and the name of the metric (or factor), separating the two with
"_" (e.g. "da7_busy_pct"; "ada0_mb_per_second_write";
"ada4_queue_length"). I am not certain that's the best thing I could
have done -- and I'm open to changing the approach.
The challenge for me is that different (classes of) machines are
provisioned differently; some consequennces of that:
* While da1 may be a spinning disk on host A, that has no bearing on
whether or not the "da1" on host B is a spinning disk or an SSD.
* Host C may not even have a "da1" device.
* Host D may be of a type that normally has a "da1," but in this case,
the drive has failed and has been disabled (so host D won't report
anything about "da1").
(I'm not too bothered about the "non-reporting" case, but cite it so we
all know about it.)
I expect I will want to be using groupings:
* All disk devices -- this one is easy.
* All SSD devices (excluding spinning disks).
* All spinning disks (excluding SSDs).
I'm having trouble with the latter two (though, certainly, if I solve
one, the other is also solved).
Also, for some of the metrics, I will want to sum them; for others,
I will want to do other things -- find minima or maxima, or average
them. So pre-calculating such aggregates in the Perl script isn't
something that appeals to me.
Finally (as far as complications go), I'm trying to write the code in
such a way that if we deploy a new configuration of machine that has
(say) twice as many drives as the biggest one we presently deploy, the
code Just Works -- I shouldn't need to update the code merely to adapt
to another hardware configuration.
I have been able to write a function that takes the data.frame obtained
by reading the above-cited CSV, and generates a data.frame with a row
for each host, and depicts the "dev_type" for each device for that host;
here's an abbreviated (and slightly redacted) copy of its output to
illustrate some of the above:
ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36 da3
host_A ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd
host_B ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd
host_G ssd ssd ssd ssd ssd ssd ssd
host_H ssd ssd ssd ssd ssd ssd ssd
host_M ssd ssd ssd ssd ssd ssd ssd
host_N ssd ssd ssd ssd ssd ssd ssd
(That function is written with the explicit assumption(!) that for the
period covered by a given set of input data, a given host's
configuration remains static: we won't have drives changing type
mid-stream.)
So the point of this lengthy(!) note is to ask if there's a
somewhat-sane way to be able to group the metrics for the "ssd" devices
(for example), given the above.
(So far, the least obnoxious way that comes to mind is to actually
create 2 columns for each device metric: one for the device if it's an
"ssd";l the other for "hdd" -- so instead of columns such as:
* da3_busy_pct
* da3_dev_type
* da3_kb_per_transfer_read
* da36_cam_timeouts
* da36_dev_type
* da36_mb_per_second_read
I would have:
* da3_hdd_busy_pct
* da3_ssd_busy_pct
* da3_hdd_dev_type
* da3_ssd_dev_type
* da3_hdd_kb_per_transfer_read
* da3_ssd_kb_per_transfer_read
* da36_hdd_cam_timeouts
* da36_ssd_cam_timeouts
* da36_hdd_dev_type
* da36_ssd_dev_type
* da36_hdd_mb_per_second_read
* da36_ssd_mb_per_second_read
and no more than half of those would actually be populated (depending on
the content of "dev_type" when the Perl script is creating the CSV).
That seems rather hackish, though.
Thank you in advance for any insight.
Peace,
david
--
David H. Wolfskill r at catwhisker.org
Those who murder in the name of God or prophet are blasphemous cowards.
See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 949 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20150330/984d379b/attachment.bin>
More information about the R-help
mailing list