[R] reading a text file with a stray carriage return

Thu Mar 8 13:14:32 CET 2007

On Thu, Mar 08, 2007 at 10:11:06AM -0000, Ted Harding wrote:
> On 08-Mar-07 jim holtman wrote:
> > How do you define a carriage return in the middle of a line
> > if a carriage return is also used to delimit a line?  One of the
> > things you can do is to use 'count.fields' to determine the
> > number of fields in each line. For those lines that are not the
> > right length, you could combine them together with a 'paste'
> > command when you write them out.
> 
>... 
> cat funnyfile.csv | awk 'BEGIN{FS=","}
>   {nline = NR; nfields = NF}
>   {if(nfields != 3){print nline " " nfields}'

I think this might be more 'awk like':

cat funnyfile.csv | awk 'BEGIN{FS=","} NF != 3 {print NR, NF}'

(separating the "pattern" from the "action", I mean. it's slightly
faster too -- but not much). and since everyone has the  same problems,
here's my fieldcounter/reporter (the actual awk program (apart from reporting
the results) is a single line, cute..).
#################################################################################
#!/bin/sh
#scan input files and output statistics of number of fields.
#suitable mainly to check 'rectangular' data files for deviations
#from correct number of columns.

FS='[ \t]+'  #default FS
COMMENT='#'

v1=F
v2=c
valop=$v1:$v2:
while getopts $valop option; do
   if   [ "$option" = "$v1" ]; then
      FS="$OPTARG"
   elif   [ "$option" = "$v2" ]; then
      COMMENT="$OPTARG"
   else
      echo ""
      echo "usage: fieldcount [-$v1 field_separator] [-$v2 comment_char] file ..."
      echo "default FS is white space"
      echo "default comment_char is #"
      echo ""
      exit
   fi
done
shift `expr $OPTIND - 1`

for i do
   echo ""
   echo "file $i (using FS = '$FS'):"
   awk '
      BEGIN {
         FS = "'"$FS"'"
      }
      !/^'"$COMMENT"'/ {fc[NF]++}
      END {
         for (nf in fc) {
            if (fc[nf] > 1) s1 = "s"; else s1 = ""
            if (nf > 1)     s2 = "s"; else s2 = ""
            print fc[nf], "record" s1, "with", nf, "field" s2
         }
      }
   ' $i |sort -rn
   echo  ""
done
#################################################################################
here, you could use this a la (it uses white space as separator per default):

fieldcount -F"," funnyfile1.csv funnyfile2.csv ...

it's handy for huge files, since it outputs cumulative results reverse sorted
by frequency of occurence (the bad guys should be infrequent and
at the bottom, contrary to real world experience...).

one could of course tweak this to report only records deviating from some
expected field count (say 3) by adding a further command line switch and using
this in the awk script as is done with the other shell variables ($FS and
$COMMENT).

joerg

> 
> 
> > On 3/7/07, Walter R. Paczkowski <dataanalytics at earthlink.net> wrote:
> >>
> >>
> >>   Hi,
> >>   I'm  hoping someone has a suggestion for handling a simple problem. 
> >>   A client  gave  me a comma separated value file (call it x.csv)
> >>   that has an id and name and address for about 25,000 people
> >>   (25,000 records).
> >>   I used read.table to read it, but then discovered that there are
> >>   stray carriage returns on several records. This plays havoc with
> >>   read.table since it starts a new input line when it sees the
> >>   carriage return. 
> >>   In short, the read is all wrong.
> >>   I thought I could write a simple function to parse a line and
> >>   write it back  out,  character by character. If a carriage
> >>   return is found, it  would  simply  be  ignored on the writing
> >>   back out part. But how do I identify a carriage return? What is
> >>   the code or symbol? Is there any easier way to rid the file
> >>   of carriage returns in the middle of the input lines?
> >>   Any help is appreciated.
> >>   Walt Paczkowski
>