[R] subsetting large data frames.
Marc Schwartz
marc_schwartz at comcast.net
Sun Dec 7 20:04:18 CET 2008
on 12/07/2008 11:16 AM hesicaia wrote:
> Hi all,
> I have a question regarding subsetting of large data frames. I have two
> data frames “catches” and “tows” and they both have the same 30 variables
> (columns). I would like to select rows in the data frame “tows” where all 5
> specific variables are NOT matched in “catches. That is to say, the
> combination of these 5 variables is unique. One or more of the variables
> could be the same but the combination would be unique. This is confusing to
> explain so here is a short example to explain what I am trying to explain:
>
> Example data catches:
>
> Row Cruise Order Townumber Towtype Ship Netlocation Var1 Var2
> 1 22 1 4 A B S X1 X2
> 2 22 1 4 A B S X1 X2
> 3 22 1 4 BL AM S X1 X2
> 4 22 1 4 BL AM S X1 X2
> 5 260 1 4 BL B S X1 X2
> 6 260 1 4 BL B S X1
> X2
>
> Example data tows:
>
> Row Cruise Order Townumber Towtype Ship Netlocation Var1 Var2
> 1 22 1 4 A B S X1 X2
> 2 400 1 4 BL AM S X1 X2
> 3 260 1 4 BL B S X1 X2
> 4 260 10 10 BL B S X1 X2
> 5 22 99 4 BL B S X1 X2
>
> I would want to select rows 2, 4, and 5 from “tows” due to the fact that the
> same collection of “cruise”, ”order”, ”townumber”, ”towtype”, ”ship”, and
> ”netlocation” are not found in “catches”. All rows in data set “tows” are
> unique. Clear as mud? Sorry I couldn’t provide real data, but these datasets
> are quite large.
>
> So far I have tried:
>
> New<-tows[(tows$cruise != catches$cruise) & (tows$order != catches$order) &
> (tows$townumber != catches$townumber) & (tows$towtype != catches$towtype) &
> (tows$ship != catches$ship) & (tows$netlocation != catches$netlocation),]
>
> But this didn’t work.
> Thanks for your time and help (in advance).
> Dan.
Your statement above won't work, as there are a different number of rows
in each dataframe.
Such comparisons would be made on a row-by-row basis. Thus rather than
looking for any non-matching combinations across rows, you would be only
comparing row 1 with row 1, row 2 with row 2 and so on.
More generally, subset() is a better approach when using complicated
logicals in a subsetting operation.
With this situation, one approach would be to use merge() to check for
rows that do have matches in the relevant columns. Then take the Row
values (assuming that these are unique) that do not end up in the
merge()d dataset.
So, the first step is to use merge() to link the two dataframes based
upon the matching values and force all of the rows in tows to be
included via 'all.x = TRUE':
> merge(tows, catches, by = c("Cruise", "Order", "Townumber", "Towtype",
"Ship", "Netlocation"),
all.x = TRUE)
Cruise Order Townumber Towtype Ship Netlocation Row.x Var1.x Var2.x
1 22 1 4 A B S 1 X1 X2
2 22 1 4 A B S 1 X1 X2
3 22 99 4 BL B S 5 X1 X2
4 260 1 4 BL B S 3 X1 X2
5 260 1 4 BL B S 3 X1 X2
6 260 10 10 BL B S 4 X1 X2
7 400 1 4 BL AM S 2 X1 X2
Row.y Var1.y Var2.y
1 1 X1 X2
2 2 X1 X2
3 NA <NA> <NA>
4 5 X1 X2
5 6 X1 X2
6 NA <NA> <NA>
7 NA <NA> <NA>
Note that 'Row', since there is the same column in each dataframe, has a
'.x' and '.y' appended to denote the initial source dataframe.
Note further, that Row.y has NA values, for the rows in catches that did
not have matches in tows.
We can adjust the above to take only these two columns:
DF <- merge(tows, catches,
by = c("Cruise", "Order", "Townumber", "Towtype",
"Ship", "Netlocation"),
all.x = TRUE)[, c("Row.x", "Row.y")]
> DF
Row.x Row.y
1 1 1
2 1 2
3 5 NA
4 3 5
5 3 6
6 4 NA
7 2 NA
Now, get the values of Row.x as a vector, where Row.y is NA:
> subset(DF, is.na(Row.y))[[1]]
[1] 5 4 2
Thus:
> tows[subset(DF, is.na(Row.y))[[1]], ]
Row Cruise Order Townumber Towtype Ship Netlocation Var1 Var2
5 5 22 99 4 BL B S X1 X2
4 4 260 10 10 BL B S X1 X2
2 2 400 1 4 BL AM S X1 X2
See ?merge and ?subset
HTH,
Marc Schwartz
More information about the R-help
mailing list