[R] Drop matching lines from readLines
Bert Gunter
gunter.berton at gene.com
Thu Oct 14 17:55:41 CEST 2010
If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely
cleanData <- dirtyData[!grepl("errorPatternregex",dirtyData)]
sufficient?
Cheers,
Bert
On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka <marchywka at hotmail.com> wrote:
>
>
>
>
>
>
> ----------------------------------------
>> From: santosh.srinivas at gmail.com
>> To: r-help at r-project.org
>> Date: Thu, 14 Oct 2010 11:27:57 +0530
>> Subject: [R] Drop matching lines from readLines
>>
>> Dear R-group,
>>
>> I have some noise in my text file (coding issues!) ... I imported a 200 MB
>> text file using readlines
>> Used grep to find the lines with the error?
>>
>> What is the easiest way to drop those lines? I plan to write back the
>> "cleaned" data set to my base file.
>
> Generally for text processing, I've been using utilities external to R
> although there may be R alternatives that work better for you. You
> mention grep, I've suggested sed as a general way to fix formatting things,
> there is also something called "uniq" on linux or cygwin.
> I have gotten into the habit of using these for a variety of data
> manipulation tasks, only feed clean data into R.
>
> $ echo -e a bc\\na bc
> a bc
> a bc
>
> $ echo -e a bc\\na bc | uniq
> a bc
>
> $ uniq --help
> Usage: uniq [OPTION]... [INPUT [OUTPUT]]
> Filter adjacent matching lines from INPUT (or standard input),
> writing to OUTPUT (or standard output).
>
> With no options, matching lines are merged to the first occurrence.
>
> Mandatory arguments to long options are mandatory for short options too.
> -c, --count prefix lines by the number of occurrences
> -d, --repeated only print duplicate lines
> -D, --all-repeated[=delimit-method] print all duplicate lines
> delimit-method={none(default),prepend,separate}
> Delimiting is done with blank lines
> -f, --skip-fields=N avoid comparing the first N fields
> -i, --ignore-case ignore differences in case when comparing
> -s, --skip-chars=N avoid comparing the first N characters
> -u, --unique only print unique lines
> -z, --zero-terminated end lines with 0 byte, not newline
> -w, --check-chars=N compare no more than N characters in lines
> --help display this help and exit
> --version output version information and exit
>
> A field is a run of blanks (usually spaces and/or TABs), then non-blank
> characters. Fields are skipped before chars.
>
> Note: 'uniq' does not detect repeated lines unless they are adjacent.
> You may want to sort the input first, or use `sort -u' without `uniq'.
> Also, comparisons honor the rules specified by `LC_COLLATE'.
>
>
>
>
>
>
>
>
>
>
>>
>> Thanks.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Bert Gunter
Genentech Nonclinical Biostatistics
More information about the R-help
mailing list