[R] Newbie wants to compare 2 huge RDSs row by row.

Marsh Hardy ARA/RISK mhardy at ara.com
Sun Jan 28 17:22:20 CET 2018


Thanks, I think I've found the most succinct expression of differences in two data.frames...

length(which( rowSums( x1 != x2 ) > 0))

gives a count of the # of records in two data.frames that do not match.

// 
________________________________________
From: Henrik Bengtsson [henrik.bengtsson at gmail.com]
Sent: Sunday, January 28, 2018 11:12 AM
To: Ulrik Stervbo
Cc: Marsh Hardy ARA/RISK; r-help at r-project.org
Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.

The diffobj package (https://cran.r-project.org/package=diffobj) is
really helpful here.  It provides "diff" functions diffPrint(),
diffStr(), and diffChr() to compare two object 'x' and 'y' and provide
neat colorized summary output.

Example:

> iris2 <- iris
> iris2[122:125,4] <- iris2[122:125,4] + 0.1

> diffobj::diffPrint(iris2, iris)
< iris2
> iris
@@ 121,8 / 121,8 @@
~     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
  120          6.0         2.2          5.0         1.5  virginica
  121          6.9         3.2          5.7         2.3  virginica
< 122          5.6         2.8          4.9         2.1  virginica
> 122          5.6         2.8          4.9         2.0  virginica
< 123          7.7         2.8          6.7         2.1  virginica
> 123          7.7         2.8          6.7         2.0  virginica
< 124          6.3         2.7          4.9         1.9  virginica
> 124          6.3         2.7          4.9         1.8  virginica
< 125          6.7         3.3          5.7         2.2  virginica
> 125          6.7         3.3          5.7         2.1  virginica
  126          7.2         3.2          6.0         1.8  virginica
  127          6.2         2.8          4.8         1.8  virginica

What's not show here is that the colored output (supported by many
terminals these days) also highlights exactly which elements in those
rows differ.

/Henrik

On Sun, Jan 28, 2018 at 12:17 AM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:
> The anti_join from the package dplyr might also be handy.
>
> install.package("dplyr")
> library(dplyr)
> anti_join (x1, x2)
>
> You can get help on the different functions by ?function.name(), so
> ?anti_join() will bring you help - and examples - on the anti_join
> function.
>
> It might be worth testing your approach on a small subset of the data. That
> makes it easier for you to follow what happens and evaluate the outcome.
>
> HTH
> Ulrik
>
> Marsh Hardy ARA/RISK <mhardy at ara.com> schrieb am So., 28. Jan. 2018, 04:14:
>
>> Cool, looks like that'd do it, almost as if converting an entire record to
>> a character string and comparing strings.
>>
>> ________________________________________
>> From: William Dunlap [wdunlap at tibco.com]
>> Sent: Saturday, January 27, 2018 4:57 PM
>> To: Marsh Hardy ARA/RISK
>> Cc: Ulrik Stervbo; Eric Berger; r-help at r-project.org
>> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
>>
>> If your two objects have class "data.frame" (look at class(objectName))
>> and they
>> both have the same number of columns and the same order of columns and the
>> column types match closely enough (use all.equal(x1, x2) for that), then
>> you can try
>>      which( rowSums( x1 != x2 ) > 0)
>> E.g.,
>> > x1 <- data.frame(X=1:5, Y=rep(c("A","B"),c(3,2)))
>> > x2 <- data.frame(X=c(1,2,-3,-4,5), Y=rep(c("A","B"),c(2,3)))
>> > x1
>>   X Y
>> 1 1 A
>> 2 2 A
>> 3 3 A
>> 4 4 B
>> 5 5 B
>> > x2
>>    X Y
>> 1  1 A
>> 2  2 A
>> 3 -3 B
>> 4 -4 B
>> 5  5 B
>> > which( rowSums( x1 != x2 ) > 0)
>> [1] 3 4
>>
>> If you want to allow small numeric differences but exactly character
>> matches
>> you will have to get a bit fancier.  Splitting the data.frames into
>> character and
>> numeric parts and comparing each works well.
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com<http://tibco.com>
>>
>> On Sat, Jan 27, 2018 at 1:18 PM, Marsh Hardy ARA/RISK <mhardy at ara.com
>> <mailto:mhardy at ara.com>> wrote:
>> Hi Guys, I apologize for my rank & utter newness at R.
>>
>> I used summary() and found about 95 variables, both character and numeric,
>> all with "Length:368842" I assume is the # of records.
>>
>> I'd like to know the record number (row #?) of any record where the data
>> doesn't match in the 2 files of what should be the same output.
>>
>> Thanks in advance, M.
>>
>> //
>> ________________________________________
>> From: Ulrik Stervbo [ulrik.stervbo at gmail.com<mailto:
>> ulrik.stervbo at gmail.com>]
>> Sent: Saturday, January 27, 2018 10:00 AM
>> To: Eric Berger
>> Cc: Marsh Hardy ARA/RISK; r-help at r-project.org<mailto:r-help at r-project.org
>> >
>> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
>>
>> Also, it will be easier to provide helpful information if you'd describe
>> what in your data you want to compare and what you hope to get out of the
>> comparison.
>>
>> Best wishes,
>> Ulrik
>>
>> Eric Berger <ericjberger at gmail.com<mailto:ericjberger at gmail.com><mailto:
>> ericjberger at gmail.com<mailto:ericjberger at gmail.com>>> schrieb am Sa., 27.
>> Jan. 2018, 08:18:
>> Hi Marsh,
>> An RDS is not a data structure such as a data.frame. It can be anything.
>> For example if I want to save my objects a, b, c I could do:
>> > saveRDS( list(a,b,c,), file="tmp.RDS")
>> Then read them back later with
>> > myList <- readRDS( "tmp.RDS" )
>>
>> Do you have additional information about your "RDSs" ?
>>
>> Eric
>>
>>
>> On Sat, Jan 27, 2018 at 6:54 AM, Marsh Hardy ARA/RISK <mhardy at ara.com
>> <mailto:mhardy at ara.com><mailto:mhardy at ara.com<mailto:mhardy at ara.com>>>
>> wrote:
>>
>> > Each RDS is 40 MBs. What's a slick code to compare them row by row, IDing
>> > row numbers with mismatches?
>> >
>> > Thanks in advance.
>> >
>> > //
>> >
>> > ______________________________________________
>> > R-help at r-project.org<mailto:R-help at r-project.org><mailto:
>> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
>> UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/
>> > posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org<mailto:R-help at r-project.org><mailto:
>> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
>> UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
>> UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list