[R] Drop matching lines from readLines

Mike Marchywka marchywka at hotmail.com
Thu Oct 14 13:05:00 CEST 2010







----------------------------------------
> From: santosh.srinivas at gmail.com
> To: r-help at r-project.org
> Date: Thu, 14 Oct 2010 11:27:57 +0530
> Subject: [R] Drop matching lines from readLines
>
> Dear R-group,
>
> I have some noise in my text file (coding issues!) ... I imported a 200 MB
> text file using readlines
> Used grep to find the lines with the error?
>
> What is the easiest way to drop those lines? I plan to write back the
> "cleaned" data set to my base file.

Generally for text processing, I've been using utilities external to R
although there may be R alternatives that work better for you. You
mention grep, I've suggested sed as a general way to fix formatting things,
there is also something called "uniq" on linux or cygwin.
I have gotten into the habit of using these for a variety of data
manipulation tasks, only feed clean data into R.

$ echo -e a bc\\na bc
a bc
a bc

$ echo -e a bc\\na bc | uniq
a bc

$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines
  -D, --all-repeated[=delimit-method]  print all duplicate lines
                        delimit-method={none(default),prepend,separate}
                        Delimiting is done with blank lines
  -f, --skip-fields=N   avoid comparing the first N fields
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated  end lines with 0 byte, not newline
  -w, --check-chars=N   compare no more than N characters in lines
      --help     display this help and exit
      --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
Also, comparisons honor the rules specified by `LC_COLLATE'.










>
> Thanks.

 		 	   		  


More information about the R-help mailing list