[R] Text Input from a Non Delimited File

Mon Feb 10 00:45:25 CET 2014

On 14-02-09 5:56 PM, Burhan ul haq wrote:
> Hi,
>
> Minor Additions:
>
> The original file was as follows:
>
> ##  -------------------------------------------------------------------
> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
> 1 10038 Carl Allwood M Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
> 2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
> 3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
> 4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
> 5 10187 Peter Stockdale M 02:45:26 5 02:45:25
> 6 10064 Jared Bethell M Harlow RC 02:46:43 6 02:46:40
> 7 13003 Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
> 8 13009 Rod Harris M 02:47:47 8 02:47:45
> 9 10033 Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
> 10 10037 Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
> 11 10048 Pavel Toropov M 02:50:41 11 02:50:41
> 12 10008 Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
> 13 10044 Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
> 14 10380 Ludovic Renou M 02:53:37 14 02:53:34
> 15 10056 Alex Keenan M 02:53:48 15 02:53:47
> ##  -------------------------------------------------------------------
>
> Available here:
> http://www.coltishalljaguars.co.uk/wp-content/uploads/2011/09/Robin-hood2011.pdf
>
> I am able to match a single entry with the regular expression:
> ^(\d+),(\d+),( )(.)*(M |F )(.)*(\d{2}):(\d{2}):(\d{2})( )(\d{1,})(
> )(\d{2}):(\d{2}):(\d{2})
>
> But unable to handle the back reference mechanism well. And put commas
> to delimit the text.
>
> I believe "regular expressions" pertain to R as much as they do to
> Sublime, but please let me know, if I should be posting this to
> "sublime" forum.

I would do the field extraction in R.  Read the file using readLines(), 
then use regular expressions to extract the fields one at a time.  You 
could identify them all in one RE, but why not break it down into 
simpler problems?

By field extraction, I mean things like this:

lines <- readLines(...)
field1 <- sub(",.*", "", lines)
field2 <- sub(".*,(\\d+),.*", "\\1", lines)

etc.

Duncan Murdoch

>
>
>
> \\Cheers
>
>
> On Mon, Feb 10, 2014 at 3:48 AM, Burhan ul haq <ulhaqz at gmail.com> wrote:
>> Hi,
>>
>> I am trying to read in a file, which is not delimited by any specific
>> characters.
>>
>> Something as follows:
>> ##  -------------------------------------------------------------------
>> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
>> 1,10038, Carl Allwood M Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
>> 2,10098, Adam Holland M Votwo/USN 02:41:25 2 02:41:25
>> 3,13007, Pumlani Bangani M 02:43:23 3 02:43:23
>> 4,10028, Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
>> 5,10187, Peter Stockdale M 02:45:26 5 02:45:25
>> 6,10064, Jared Bethell M Harlow RC 02:46:43 6 02:46:40
>> 7,13003, Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
>> 8,13009, Rod Harris M 02:47:47 8 02:47:45
>> 9,10033, Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
>> 10,10037, Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
>> 11,10048, Pavel Toropov M 02:50:41 11 02:50:41
>> 12,10008, Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
>> 13,10044, Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
>> 14,10380, Ludovic Renou M 02:53:37 14 02:53:34
>> 15,10056, Alex Keenan M 02:53:48 15 02:53:47
>> ##  -------------------------------------------------------------------
>>
>>
>> As I failed to read it in via R or Excel, I used a text editor with
>> regular expressions, sublime to be exact. I was trying to convert it
>> in CSV format, and was successful to put commas for the first two
>> entries, as follows:
>>
>> ##  -------------------------------------------------------------------
>> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
>> 1,10038, Carl Allwood ,M ,Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
>> 2,10098, Adam Holland ,M ,Votwo/USN 02:41:25 2 02:41:25
>> 3,13007, Pumlani Bangani ,M ,02:43:23 3 02:43:23
>> 4,10028, Anthony Jackson ,M ,Sittingbourne Striders 02:44:39 4 02:44:39
>> 5,10187, Peter Stockdale ,M ,02:45:26 5 02:45:25
>> 6,10064, Jared Bethell ,M ,Harlow RC 02:46:43 6 02:46:40
>> 7,13003, Sarah Harris ,F ,35 Long Eaton RC 02:47:47 7 02:47:44
>> 8,13009, Rod Harris ,M ,02:47:47 8 02:47:45
>> 9,10033, Carl Sommer ,M ,Huncote Harriers 02:47:59 9 02:47:58
>> 10,10037, Peter Swaine ,M ,Charnwood AC 02:49:28 10 02:49:27
>> 11,10048, Pavel Toropov ,M ,02:50:41 11 02:50:41
>> 12,10008, Derek Dunne ,M ,45 Treasury Running Club 02:51:42 12 02:51:40
>> 13,10044, Matthew Nutt ,M ,Scunthorpe 02:52:20 13 02:52:15
>> 14,10380, Ludovic Renou ,M ,02:53:37 14 02:53:34
>> 15,10056, Alex Keenan ,M ,02:53:48 15 02:53:47
>> ##  -------------------------------------------------------------------
>>
>> I am failing after that, I tried to search the expression:
>> (.)*(\d{2}:\d{2}:\d{2})( )
>> and replace it with: \1,\2,\3, with the result:
>>
>> ##  -------------------------------------------------------------------
>> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
>> ,02:38:40, 1 02:38:40
>>   ,02:41:25, 2 02:41:25
>> ##  -------------------------------------------------------------------
>>
>> How do I fix the regular expression here. If you examine the later
>> entries some name contains hyphen, or have three parts, so other
>> approaches do not work well.
>>
>> Secondly, is there a better way to handle this problem. The original
>> input file is in pdf format.I copied the text, and made a txt file out
>> of it.
>>
>> The input txt file is attached.
>>
>> Thanks in advance for any suggestions.
>>
>> \\Cheers
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>