[R] Row exclude

Sun Jan 30 04:56:46 CET 2022

Val,

In this special case the unique() is extra overhead of sorts. 

in a normal positive scenario of dat[c(1,2,3,2,3,1)] you would get duplicate rows shown as many times as they are invoked. 

In the similar but different negative scenario,  dat[-c(1,2,3,2,3,1)] the functionality seems to be different as you can subtract a row once and then subtract it again and again with no further effect. It is sort of like creating a set and then subtracting from the set any values if they exist and doing nothing if they, don't.

Now in your 7 row case, there was one problem in each of three rows and there was no overlap but the following will remove only rows 2 and 3 no matter how often it comes up:

> dat1[-c(2,3,3,2,3,2,3,2,3,2),]
   Name   Age Weight
1  Alex    20    13X
4  John   3BC    175
5  Katy    35    160
6 Jack3    34    140

Order and even redundancies do not matter, so no unique is needed. I suspect the overhead of a few duplicates is minor. 

But having said that, I believe the implementation of unique does not use tons of resources as it returns the results in the same order it sees then and dos not do a sort:

> unique(c(6,2,3,3,2,3,2,3,2,3,2,6))
[1] 6 2 3

So in the positive case where you often want one copy of everything, it may make sense to use and even to sort it afterward. 

You were not clear about your need to handle large amounts of data so some suggested solutions may scale up better than others. You still have not shared if you plan on using more columns too or if any columns are meant to be ignored or dealt with yet another way in checking them. You also did not specify what if anything should be done if a column entry is NA or even possibly a few other things like Inf. 

Any suggestions you got are not offered as working well if you add additional requirements and constraints. As another example, in R you can enter some numbers, including as integers, several other ways:

mixed <- c(111, 222L, 0x29A, 1E3)
[1]  111  222  666 1000

I changed your dat1 to look like this:

dat1 <-read.table(text="Name, Age, Weight
Alex,  20,  13
Bob,  25,  222L
Carol, 24, 0x29A
John,  3BC, 1E3
Katy,  35,  160
Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F, strip.white=TRUE)

I also stripped whitespace and it looks like this:

> dat1
   Name Age Weight
1  Alex  20     13
2   Bob  25   222L
3 Carol  24  0x29A
4  John 3BC    1E3
5  Katy  35    160
6 Jack3  34    140

Now look at column 2, Weight. As it is all seen as character, it retained those constructs but will R be able to convert them if I ask for as.integer?

> as.integer(dat1$Weight)
[1]   13   NA  666 1000  160  140
Warning message:
NAs introduced by coercion 

It looks like adding an L fails but the hexadecimal notation works and so does the scientific notation.

So you have every right to insist the only thing to be entered in a numeric column are digits from 0 to 9. Realistically, any entering the forms above, as well as numbers like 3.14, would otherwise work fine and if converted to an integer, would survive, albeit the latter would be truncated. Writing a regular expression that matches these is not straightforward but can be done. But the suggestions people made assume you are restricting it to standard decimal notation and that is fine.

Now on to size. If you have a million rows in your data, the various algorithms proposed vary in what they do in between and some make all kinds of data structures that store integers or logicals in size up to or close to a million. But the nature of your requirement turns out to be amenable to working in smaller batches if that makes you happy. You could operate on something like a thousand rows at a time, and calculate which rows to keep and set them aside. Repeat for another thousand at a time till done and along the way merge what you are keeping. The memory used for each batch will be limited and reused regularly and garbage collection will deal with the rest. You do not even have to do the concatenation incrementally if you simply keep track of your offset at the start of each batch and add that back to the vector of indices of rows to keep and keep extending the vector of rows. At the end, you can use that to index all the data and shrink it.

There are many such variations if you want to consider efficiency of memory or CPU time, albeit you probably can handle amounts like a million rows, but not billions, on most machines. For much larger sets of data, you could read in lines (or all at once but process one row at a time) and then selectively write the good ones back to disk or to a database one at a time (or in small batches) and then you can delete all current uses of memory and read it in again, but this time asking read.table to read in certain columns as integers, or even smaller integers, and the text column as character. If lots of rows are being skipped, you now have a smaller memory footprint for additional operations that presumably follow. 

Good luck.

-----Original Message-----
From: Val <valkremk using gmail.com>
To: David Carlson <dcarlson using tamu.edu>
Cc: r-help using R-project.org (r-help using r-project.org) <r-help using r-project.org>
Sent: Sat, Jan 29, 2022 9:32 pm
Subject: Re: [R] Row exclude

Thank you David for your help.

I just have one question on this. What is the purpose of  using the
"unique" function on this?
  (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])

I got the same result without using it.
       (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])

My concern is when I am applying this for the large data set the "unique"
function may consume resources(time  and memory).

Thank you.

de.html
and provide commented, minimal, self-contained, reproducible code.