[R] How to parse a really silly date with lubridate

@vi@e@gross m@iii@g oii gm@ii@com @vi@e@gross m@iii@g oii gm@ii@com
Wed Jul 13 20:36:56 CEST 2022


Others have supplied ways to do this using various date management functions. But I want to add another option that may make sense if the dates are not all quite as predictable.

You can use your own regular expressions in R as in most languages, that try to match each entry to one or more patterns you supply. When it matches a pattern, you can capture various bits of what you find and recombine them into any valid date format you want or convert them into some date type.

You just showed two pseudo-date-time entries:

	 1 9. Jul 2022 at 11:39
	 2 10. Jul 2022 at 01:58

It looks like the first item in each is a sequence number that can be ignored. But I assume after the ninth, it takes up ever more space as it becomes 10 then 100 and so on. It would thus be of varying width. So, in English, you are looking for a pattern like:

- beginning of the line/entry
- one or more digits in 0-9 followed by what looks like a space, but maybe arbitrary whitespace.
- capture one or two more digits if followed by a period, but not the period.
- capture what looks like a three-letter character string representing a month as a "word" but not what follows.
- whitespace followed by "at" followed by more whitespace can be ignored but must be present.
- capture a two digit number perhaps including the colon and another two-digit number, or two parts without the colon.
- match the end of the entry

If it looks like the above, and you captured the parts needed, you can reconstitute the parts, albeit perhaps first checking to see if the month names are valid (which can be handled by a pattern part that looks like "(Jan|Feb|...|Dec)" and of course making sure the other parts also fit requirements such as days as integers being between 1 and 31 and similar issues about time. My guess is that you don't really want to validate that deeply, let alone do the validation within the regular expression.

But when your data ends up having multiple ways of saying the date that you can enumerate and tell apart, this can be a way to parse them and identify what it matches and so on.

In the above case, full regular expressions may be overkill nd regular string functions might suffice.  There are many ways, including something as simple as this:

dt <- "1 9. Jul 2022 at 11:39"
dt_parts <- strsplit(dt, " ")

Unfortunately, this returns a vector of parts with a decimal point still in place:

> dt_parts
[1] "1"     "9."    "Jul"   "2022"  "at"    "11:39"

You can now ignore  the parts in the first fifth positions as the sequence number and "at" are meaningless and use these:

the_day <- dt_parts[2]
the_month <- dt_parts[3]
the_year <- dt_parts[4]
the_time <- dt_parts[6]

Of course, you now may want further processing using one of many ways to turn something like "9." (note there may be a "12." too) or even make it an integer instead of a character string. You may want to change the dates into numerals with something like this:

month_conversion <- list("Jan"="01", Feb="02", Mar="03", Apr="04", May="05", Jun="06", 
                         Jul="07", Aug="08", Sep="09", Oct="10", Nov="11", Dec="12")
month_numeric_string <- month_conversion[the_month]

My point is that if your format is consistent, you can pick out the parts you want as strings or make them decimals or whatever you need and use string functions, such as paste0() to recombine them into the format you want. 

Of course if you can tell some date-related function how to do it for you, that is obviously simpler and better for you.

And, obviously, it would be even nicer if you could get them to not give dates in this format!

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Rui Barradas
Sent: Wednesday, July 13, 2022 10:48 AM
To: Dr Eberhard Lisse <nospam using lisse.NA>; r-help using r-project.org
Subject: Re: [R] How to parse a really silly date with lubridate


With lubridate, note in what sequence your datetime elements occur and use the appropriate function.

d <- c('9. Jul 2022 at 11:39', '10. Jul 2022 at 01:58')
#> [1] "2022-07-09 11:39:00 UTC" "2022-07-10 01:58:00 UTC"

Hope this helps,

Rui Barradas

Às 14:40 de 13/07/2022, Dr Eberhard Lisse escreveu:
> Hi,
> I have data file which generated by an otherwise very nice (diabetes
> log) app, but exports dates really silly.
> After reading the enclosed mwe.csv into R like so
>       MWE <- read_delim('mwe.csv', delim = ';') %>%
>          select(Date) %>%
>          print()
> this comes out as:
>       # A tibble: 2 × 1
>      Date
>      <chr>
>       1 9. Jul 2022 at 11:39
>       2 10. Jul 2022 at 01:58
> No matter what I try I am not able to parse this inside R to get at
> proper dates (I have loaded tidyverse and lubridate).
> I can easily do somethig
>       csvq  -d ';' -t '%e. %b %Y at %H:%i' \
>          'SELECT Date as oridate,
>              DATETIME_FORMAT(Date, "%Y-%m-%d %H:%m") AS date
>          FROM mwe'
>       +-----------------------+------------------+
>       |        oridate        |       date       |
>       +-----------------------+------------------+
>       | 9. Jul 2022 at 11:39  | 2022-07-09 11:07 |
>       | 10. Jul 2022 at 01:58 | 2022-07-10 01:07 |
>       +-----------------------+------------------+
> and hence could easily do something like
>       csvq  -d ';' -t '%e. %b %Y at %H:%i' \
>        'ALTER mwe
>        SET Date = DATETIME_FORMAT(Date, "%Y-%m-%d %H:%m")'
> but would rather like to be able to do it inside R and would therefor
> appreciate any advice in this regard.
> greetings, el

R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list