[R] reading fixed width format data with 2 types of lines

Charles C. Berry cberry at tajo.ucsd.edu
Thu Aug 12 22:59:26 CEST 2010


On Thu, 12 Aug 2010, Tim Gruene wrote:

> I don't know if it's elegant enough for you, but you could split the file into
> two files with 'grep "^3" file > file_3' and 'grep "^4" file > file_4'
> and then read them in separately.
>

along the same lines, but all in R (untested)

original.lines <- readLines( filename )

tcon.3 <- textConnection( grep( "^3", original.lines, value=T ))
res.3 <- read.fwf( tcon.3, <etc> )
close(tcon.3)

tcon.4 <- textConnection( grep( "^4", original.lines, value=T ))
res.4 <- read.fwf( tcon.4, <etc> )
close(tcon.4)

rm( original.lines )

Or skip the readLines() step and use

tcon.3 <- pipe(paste("grep '^3'",filename))

...

I think you can use 'findstr.exe' on windows in lieu of grep.

HTH,

Chuck




> Tim
>
> On Thu, Aug 12, 2010 at 01:57:19PM -0400, Denis Chabot wrote:
>> Hi,
>>
>> I know how to read fixed width format data with read.fwf, but suddenly I need to read in a large number of old fwf files with 2 types of lines. Lines that begin with "3" in first column carry one set of variables, and lines that begin with "4" carry another set, like this:
>>
>> …
>> 3A00206546L070049016090045    99  1015002      001001008010004002004007003   001
>> 3A00206546L070049006090030    99  1029001002001001006014002
>> 3A00206546L070049002290004    99  1015            001001
>> 3A00206546L070049001692559049033  1015                                 018036024
>> 3A00206546L070049002290004    99  1001                                       002
>> 4A00176546L068047090010111000606516400150010000001501063   065914
>> 4A00176546L06804709001011100040761600000000         1092   095614
>> 4A00196546L098000100010111001706214400005010000000051062   065914
>> 4A00176546L06804709001011100050591300000000         1062   065914
>> 4A00196546L098000100010111002604721400020010000000201042   046114
>> 4A00196546L098000100010111002504221400005012000000051042   046114
>> 4A00196546L098000100010111002903721400050012200000501032   036214
>> …
>>
>> I have searched for tricks to do this but I must not have used the right keywords, I found nothing.
>>
>> I suppose I could read the entire file as a single character variable for each line, then subset for lines that begin with 3 and save this in an ascii file that will then be reopened with a read.fwf call, and do the same with lines that begin with 4. But this does not appear to me to be very elegant nor efficient… Is there a better method?
>>
>> Thanks in advance,
>>
>>
>> Denis Chabot
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> -- 
> --
> Tim Gruene
> Institut fuer anorganische Chemie
> Tammannstr. 4
> D-37077 Goettingen
>
> GPG Key ID = A46BEE1A
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list