[R] Huge Dataset Dates Span two Lines

David Winsemius dwinsemius at comcast.net
Thu Jan 8 22:41:22 CET 2015


On Jan 8, 2015, at 10:20 AM, DVL wrote:

> I'm trying to import a many gigabyte .txt file to analyze. It is asterisk
> delimited. I'm having an issue with the date field in the dataset. In the
> first 165 lines dates are listed as :
> YYYY-MM-DD HH:MM:SS
> 
> Then on the 166th line and in other places the date spans two lines: 
> YYYY-MM-DD
> HH:MM:SS
> 
> This causes a problem because R thinks it has reached the end of a row in
> the table. How can I solve this?

It would probably be easiest to edit the file in a text editor. I suppose you could also read the file in with readLines() and do the work all in R but that sounds a bit more painful than option 1 to my reading. If the problems are only those exactly as you describe, this could be an untested outline of a solution:

dat <- readLines("/pat/fil.ext")
marks <- nchar(dat) == 10
#or 
marks <- grepl("[*]", dat)
# append shortened lines after broken fragments
dat[ marks ] <- paste(dat[ marks ], dat[ c(head(marks,-1), FALSE) ] )
final <- dat[ ! c(head(marks,-1), FALSE) ] # remove shorter lines

> View this message in context: http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html
> Sent from the R help mailing list archive at Nabble.com.
> 

Nabble is not the Rhelp Archive and it also suppresses these message which you should be sure to read:
*______________________________________________
*R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
*https://stat.ethz.ch/mailman/listinfo/r-help
*PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
*and provide commented, minimal, self-contained, reproducible code.

-- 
David Winsemius
Alameda, CA, USA



More information about the R-help mailing list