[R] Need help to read the data file like this

David Winsemius dwinsemius at comcast.net
Sun Jul 14 19:57:46 CEST 2013


On Jul 14, 2013, at 9:48 AM, Houhou Li wrote:

> Hi,
> 
> I have several really big data files in csv format like this: the first line is the header, the second to fourth lines have info about the file and are the lines I need to skip (data in 2-4th lines are not correspoding to variable names in the hearder), from the fifth line, real data begins, but the last line is not a data line, it's the string "Done" instead of normal EOF character. All data is numeric. I tried to use read.table(), read.csv() with colClasses="numeric" and scan(), but couldn't make them work. Can anyone help me? How can I get rid of the last line "Done" automatically? I would like to use R script to do it automatically, not to do formatting in Excel then read back to R. Thank you very much, here is an example of the data:

Deleting the last line in Excel would not make sense unless this is already data in Excel. Better would be to sue a text editor. Less likely to corrupt the data.

> 
> Tag,X,Y,BlobRegion,swaths,fr_int_20,fr_int_60,i60,RawTothgt,RawHtlc,RawRad20,RawRad40,RawRad60,RawRad80,CCV,BlobPerim,n_pts,n_pts_i255,vts,vts2,vtg,home,sum_ht,sum_ht_sq,dcch,dcch2,nb_ccv,n_nb,nb_sum_hts,nb_sum_hts2,z_tip_dist,nb_MassLen,n_f_rtns20,n_f_rtns60,max_fl_pt_count,loreyrawht,p00ile_cm,p25ile_cm,p50ile_cm,p75ile_cm,iq25,iq50,iq75,mean_intns
> 01_24_2013.001,SF12
>         5413
>    509627.82,  4869704.98,   509999.83,  4869999.98
> 123,509692.55,4869856.64,18,0,80.53,81.03,84,36.2100,17.1521,4.0359,4.0359,3.8881,2.9217,1737.13,31.42,210,210,0.828,0.955,0.281,28.50,5746.46,163727.12,0.764,1.000,1147.23,33,769.16,19024.42,0.01,0.09,174,163,174,34.90,140,2369,2849,3157,33,81,110,71.59
> 159,509679.19,4869855.54,18,0,77.62,78.97,75,30.4000,11.2000,2.5319,2.5129,2.3365,1.8315,3248.82,21.42,90,90,0.877,0.936,0.589,22.91,2000.74,46861.45,0.691,0.999,1772.06,14,365.47,10233.32,0.04,0.68,81,66,81,33.29,905,1869,2272,2633,55,82,98,71.62

Read the first line with readLines using n=1 saving as 'colnams'
Read the dat <- read.table( ...  with skip=4, sep=",", and fill = TRUE
Delete last line holding "Done" and a large number of NA's
names(dat) <- scan(text=colnams, what=character(0), sep="," )

(Tested. Expected results achieved.)
-- 
David


David Winsemius
Alameda, CA, USA



More information about the R-help mailing list