[R] Preprocessing troublesome files in R - looking for some perl like functionality
Duncan Murdoch
murdoch at stats.uwo.ca
Thu Jun 2 16:28:14 CEST 2005
Andy Bunn wrote:
> Hi all:
>
> I have acquired a 100s of data files that I need to preprocess to get them
> usable in R. The files are fixed width (to a point) and contain 1 to 3 lines
> of header, followed by a variable number of fixed width data lines (that I
> can read with read.fwf). I want to read through the files and remove every
> _line_ where characters column 83-86 do not equal "STD". If I can do that
> and store it in a text file, then I can get the data I need using read.fwf.
> I can't figure out how to do this because of the irregularity of the header
> info buried in the file. It seems like the kind of thing perl or emacs would
> be good at but I'd like to do it all in R if possible. Any pointers
> appreciated.
Seems to me a couple of passes through read.fwf might work. On the
first pass, define one column running from columns 1 to 82, another from
83 to 86, another from 87 to the longest possible line width. All
columns to be class "character". Read using this format, select based
on the 2nd column, and write out the selected lines -- or use the result
as input to a textConnection.
Duncan Murdoch
>
> -Andy
>
> R > version
> _
> platform i386-pc-mingw32
> arch i386
> os mingw32
> system i386, mingw32
> status
> major 2
> minor 1.0
> year 2005
> month 04
> day 18
> language R
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> This is a snippet of one of the data files:
>
> 929 2 Russia Dahurian larch 150 6946-11249 1830 1990 -
> RAW
> RUSS061830 568 11122 1 806 1 843 2 862 3 902 31244 3 986 31210
> 31074 3 RAW
> RUSS0618401369 4 937 41154 4 869 4 702 4 716 4 972 4 682 5 878 5
> 582 5 RAW
> 929 2 Russia Dahurian larch 150 6946-11249 1830 1990 -
> STD
> RUSS061830 568 11122 1 806 1 843 2 862 3 902 31244 3 986 31210
> 31074 3 STD
> RUSS0618401369 4 937 41154 4 869 4 702 4 716 4 972 4 682 5 878 5
> 582 5 STD
> RUSS0619701158 26 906 26 954 26 746 26 629 26 858 261268 261345 261102
> 261298 26 STD
> RUSS061980 483 26 780 26 995 261273 261391 26 996 261621 26 878 261418 26
> 514 26 STD
> RUSS0619901071 269990 09990 09990 09990 09990 09990 09990 09990
> 09990 0 STD
> 929 2 Russia Dahurian larch 150 6946-11249 1830 1990 -
> RES
> RUSS061830 604 11215 1 889 1 828 2 909 3 982 31294 3 947 31091
> 31030 3 RES
> RUSS0618401290 4 858 41057 4 917 4 712 4 824 41077 4 709 5 911 5
> 747 5 RES
> RUSS061850 873 5 994 51179 71040 71028 7 923 71120 7 846 101146 11
> 854 13 RES
> RUSS0618601609 141209 16 780 16 758 171238 171191 17 858 17 903 17 930 18
> 334 18 RES
> 929 2 Russia Dahurian larch 150 6946-11249 1830 1990 -
> ARS
> RUSS061850 873 5 994 51179 71040 71028 7 923 71120 7 846 101146 11
> 854 13 ARS
> RUSS0618601609 141209 16 780 16 758 171238 171191 17 858 17 903 17 930 18
> 334 18 ARS
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list