[R] problem with duplicated function

Rolf Turner r.turner at auckland.ac.nz
Mon May 25 00:35:56 CEST 2015


On 25/05/15 09:34, Curtis Burkhalter wrote:
> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame':   969109 obs. of  5 variables:
>   $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
>   $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
>   $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
>   $ Xcor    : num  -111 -111 -111 -111 -111 ...
>   $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame':   969810 obs. of  5 variables:
>   $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
>   $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
>   $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
>   $ Xcor    : num  -111 -111 -111 -111 -111 ...
>   $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols                                            #that correspond to
> the lat/long


I get tt.dup to be:

>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
> [13] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)

This just throws away the first 10 entries of tt.dup, leaving

>  [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
        ^

This leaves the c(2,4,5,6,8,10) entries of data08.
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.

Only 4 of the entries of tt.dup are FALSE; 6 are TRUE.  I don't 
understand why you think that they are all FALSE.

Perhaps your subsets do not accurately reflect the actual nature of your 
data.

cheers,

Rolf Turner

>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
>                   cell prN Location     Xcor    Ycor
> 710229 *710228 288     Sage -111.044 41.7403*
> 715546 *715545 276     Sage -111.044 41.7245*
> 720691 *720690 286     Sage -111.044 41.7131*
> 720825 *720824 304     Sage -111.044 41.7109*
> 695612 695611 258     Sage -111.043 41.7766
> 700491 700490 257     Sage -111.043 41.7653
> 700627 700626 264     Sage -111.043 41.7630
> 705372 705371 272     Sage -111.043 41.7517
> 705508 705507 286     Sage -111.043 41.7495
> 710364 710363 316     Sage -111.043 41.7381
>
>   data08[1:10,]
>                   cell prN Location     Xcor    Ycor
> 705529 705528 293     Sage -111.044 41.7517
> 710322 *710321 281     Sage -111.044 41.7403*
> 710457 710456 299     Sage -111.044 41.7381
> 715678 *715677 278     Sage -111.044 41.7245*
> 720763 *720762 276     Sage -111.044 41.7131*
> 720897 *720896 266     Sage -111.044 41.7109*
> 699954 699953 282     Sage -111.043 41.7767
> 700636 700635 255     Sage -111.043 41.7653
> 700772 700771 287     Sage -111.043 41.7631
> 705665 705664 280     Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>


-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Home phone: +64-9-480-4619



More information about the R-help mailing list