[R] problem with duplicated function
Bert Gunter
gunter.berton at gene.com
Sun May 24 23:55:43 CEST 2015
I have NOT looked at your code in detail -- I might have if you had
used dput() to make available small subsets of your data frames that
exhibited the problems. However, the following, from ?duplicated,
sounds like it may be relevant:
"When used on a data frame with more than one column, or an array or
matrix when comparing dimensions of length greater than one, this
tests for identity of character representations. This will catch
people who unwisely rely on exact equality of floating-point numbers!
"
Cheers,
Bert
Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374
"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll
On Sun, May 24, 2015 at 2:34 PM, Curtis Burkhalter
<curtisburkhalter at gmail.com> wrote:
> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame': 969109 obs. of 5 variables:
> $ cell : int 710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
> $ prN : int 288 276 286 304 258 257 264 272 286 316 ...
> $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
> $ Xcor : num -111 -111 -111 -111 -111 ...
> $ Ycor : num 41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame': 969810 obs. of 5 variables:
> $ cell : int 705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
> $ prN : int 293 281 299 278 276 266 282 255 287 280 ...
> $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
> $ Xcor : num -111 -111 -111 -111 -111 ...
> $ Ycor : num 41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols #that correspond to
> the lat/long
>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)
>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.
>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
> cell prN Location Xcor Ycor
> 710229 *710228 288 Sage -111.044 41.7403*
> 715546 *715545 276 Sage -111.044 41.7245*
> 720691 *720690 286 Sage -111.044 41.7131*
> 720825 *720824 304 Sage -111.044 41.7109*
> 695612 695611 258 Sage -111.043 41.7766
> 700491 700490 257 Sage -111.043 41.7653
> 700627 700626 264 Sage -111.043 41.7630
> 705372 705371 272 Sage -111.043 41.7517
> 705508 705507 286 Sage -111.043 41.7495
> 710364 710363 316 Sage -111.043 41.7381
>
> data08[1:10,]
> cell prN Location Xcor Ycor
> 705529 705528 293 Sage -111.044 41.7517
> 710322 *710321 281 Sage -111.044 41.7403*
> 710457 710456 299 Sage -111.044 41.7381
> 715678 *715677 278 Sage -111.044 41.7245*
> 720763 *720762 276 Sage -111.044 41.7131*
> 720897 *720896 266 Sage -111.044 41.7109*
> 699954 699953 282 Sage -111.043 41.7767
> 700636 700635 255 Sage -111.043 41.7653
> 700772 700771 287 Sage -111.043 41.7631
> 705665 705664 280 Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>
> --
> Curtis Burkhalter
> Postdoctoral Research Associate, Audubon Rockies
>
> https://sites.google.com/site/curtisburkhalter/
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list