[R] Optimization problem: selecting independent rows to maximize the mean

Wed Mar 1 21:40:27 CET 2006

Dear R community,

I have a dataframe with 500,000 rows and 102 columns. The rows
represent spatial polygons, some of which overlap others (i.e., not
all rows are independent of each other).

Given a particular row, the first column contains a unique "RowID".
The second column contains the "Variable" of interest. The remaining
100 columns ("Overlap1" ... "Overlap100") each contain a row ID that
overlaps this row (but if this row overlaps fewer than 100 other rows
then the remainder of the columns "OL1...OL100" contain NA).

Here's the problem: I need to select the subset of 500 independent
rows that maximizes the mean and minimizes the stdev of "Variable".

Clearly this requires iterative selection and comparison of rows,
because each newly-selected row must be compared to rows already
selected to ensure it does not overlap them. At each step, a row
already selected might be removed from the subset if it can be
replaced with another that increases the mean and/or reduces the
stdev.

The above description is a simplification of my problem, but it's a start.

As I am new to R (and programming in general) I'm not sure how to
start thinking about this, or even where to look. I'd appreciate any
ideas that might help.

Thank you, Mark