[R] problem with duplicated function
Curtis Burkhalter
curtisburkhalter at gmail.com
Sun May 24 23:34:13 CEST 2015
Hello everyone,
I have two very large dataframes (~1 million rows x 5 columns), of which
two of the columns are lat/long coordinates. The names of the dataframes
are 'data07' and 'data 08'. Data08 has a few more sampling points than data
07 so I want to subset data08 so that it has the same number of data points
as data07 using the unique lat/long coordinates.
Here are the associated data structures:
*str(data07)*
'data.frame': 969109 obs. of 5 variables:
$ cell : int 710228 715545 720690 720824 695611 700490 700626 705371
705507 710363 ...
$ prN : int 288 276 286 304 258 257 264 272 286 316 ...
$ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
24 24 24 ...
$ Xcor : num -111 -111 -111 -111 -111 ...
$ Ycor : num 41.7 41.7 41.7 41.7 41.8 ...
*str(data08)*
'data.frame': 969810 obs. of 5 variables:
$ cell : int 705528 710321 710456 715677 720762 720896 699953 700635
700771 705664 ...
$ prN : int 293 281 299 278 276 266 282 255 287 280 ...
$ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
23 23 ...
$ Xcor : num -111 -111 -111 -111 -111 ...
$ Ycor : num 41.8 41.7 41.7 41.7 41.7 ...
I've tried using the following code to accomplish my problem:
tt <- rbind(data07, data08)
tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
last 2 cols #that correspond to
the lat/long
tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
n)
test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
true.
Here's a small subset of the data so that you can see exactly where there
are duplicates
data07[1:10,]
cell prN Location Xcor Ycor
710229 *710228 288 Sage -111.044 41.7403*
715546 *715545 276 Sage -111.044 41.7245*
720691 *720690 286 Sage -111.044 41.7131*
720825 *720824 304 Sage -111.044 41.7109*
695612 695611 258 Sage -111.043 41.7766
700491 700490 257 Sage -111.043 41.7653
700627 700626 264 Sage -111.043 41.7630
705372 705371 272 Sage -111.043 41.7517
705508 705507 286 Sage -111.043 41.7495
710364 710363 316 Sage -111.043 41.7381
data08[1:10,]
cell prN Location Xcor Ycor
705529 705528 293 Sage -111.044 41.7517
710322 *710321 281 Sage -111.044 41.7403*
710457 710456 299 Sage -111.044 41.7381
715678 *715677 278 Sage -111.044 41.7245*
720763 *720762 276 Sage -111.044 41.7131*
720897 *720896 266 Sage -111.044 41.7109*
699954 699953 282 Sage -111.043 41.7767
700636 700635 255 Sage -111.043 41.7653
700772 700771 287 Sage -111.043 41.7631
705665 705664 280 Sage -111.043 41.7495
If anyone has any suggestions as to where I might be going wrong I'd
greatly appreciate it.
Thank you
--
Curtis Burkhalter
Postdoctoral Research Associate, Audubon Rockies
https://sites.google.com/site/curtisburkhalter/
[[alternative HTML version deleted]]
More information about the R-help
mailing list