[R] Problems with time formats when importing data using readHTMLTable
Cristina Silva
csilva at ipma.pt
Tue Jul 11 19:00:20 CEST 2017
Dear David,
Thank you very much. You have solved my problem, not of importing in
the right format but the date conversion. It works for me, and if I
specify the timezone tz='GMT', I get exactly the same time of the
webpage.
Sincerely yours,
Cristina
On Tue, 11 Jul 2017 07:15:43 -0700, David Winsemius wrote:
>> On Jul 11, 2017, at 6:25 AM, Cristina Silva <csilva at ipma.pt> wrote:
>>
>> Dear Jeff,
>>
>> I am sorry, I didn't notice that it was not plain text. I hope that
>> it is now in the correct format. I explain the problem again, now with
>> more detais.
>> I am collecting the track positions of our research vessel from
>> www.marinetraffic.com. In the page, the data appear in a table:
>>
>> Timestamp Source Speed (kn) Latitude (°) Longitude (°)
>> Course (°) Show on Map
>> 2017-07-11 10:57 (UTC) Terr-AIS 8.6 37.36228 -9.176811
>> 200
>> 2017-07-11 10:54 (UTC) Terr-AIS 8.4 37.36875 -9.174048
>> 200
>> 2017-07-11 09:50 (UTC) Terr-AIS 8 37.51499 -9.184502
>> 149
>> 2017-07-11 09:08 (UTC) Terr-AIS 7.7 37.60513 -9.228263
>> 169
>> 2017-07-11 09:05 (UTC) Terr-AIS 8 37.61168 -9.229627
>> 167
>> 2017-07-11 09:03 (UTC) Terr-AIS 6.9 37.61626 -9.230456
>> 132
>> 2017-07-11 08:59 (UTC) Terr-AIS 3.4 37.61758 -9.237928
>> 101
>> 2017-07-11 08:57 (UTC) Terr-AIS 3.6 37.61808 -9.240235
>> 116
>> 2017-07-11 08:47 (UTC) Terr-AIS 3.3 37.62267 -9.249762
>> 112
>> 2017-07-11 08:44 (UTC) Terr-AIS 3.2 37.6241 -9.25348
>> 115
>> 2017-07-11 08:40 (UTC) Terr-AIS 3.3 37.62557 -9.257295
>> 112
>> 2017-07-11 08:32 (UTC) Terr-AIS 3.4 37.62903 -9.266028
>> 111
>> 2017-07-11 08:28 (UTC) Terr-AIS 3.3 37.63051 -9.269725
>> 114
>> 2017-07-11 08:25 (UTC) Terr-AIS 3.3 37.63207 -9.273642
>> 120
>> 2017-07-11 08:20 (UTC) Terr-AIS 3.6 37.63395 -9.278254
>> 118
>> 2017-07-11 08:16 (UTC) Terr-AIS 3.3 37.63586 -9.282853
>> 116
>> 2017-07-11 08:08 (UTC) Terr-AIS 3.2 37.6392 -9.291157
>> 118
>> 2017-07-11 07:38 (UTC) Sat-AIS
>> 2017-07-11 06:58 (UTC) Terr-AIS 3 37.68459 -9.321658
>> 159
>> 2017-07-11 06:53 (UTC) Terr-AIS 3.3 37.68881 -9.324172
>> 150
>>
>> To get this table I had to copy manually from the webpage, paste in
>> a text file, delete information on figures, links, etc. and then
>> import to R. This works well, I read the csv file and have no problems
>> with date formats. But instead of 20 records, I have to import e.g.
>> 500 records, i.e. 10 pages of 50 records each, or more. I want to
>> import these data in a more automated and quick procedure. I have done
>> this before with no problems, but probably the webpage changed some
>> permissions.
>>
>> Here is the script used to get the data directly from the webpage
>> into R, for a sample of 20 records (10 per page):
>>
>> library(XML)
>> x <- list()
>> for (i in 1:2)
>> {
>> x[i]<-
>> readHTMLTable(paste('http://www.marinetraffic.com/en/ais/index/positions/all/shipid:318358/mmsi:263601000/shipname:NORUEGA/per_page:10/page:',
>> i, sep=''))
>> }
>>
>> ais <- do.call('rbind', x)
>> ais <- ais[,-7]
>>
>> and I got the following table:
>>
>>> ais
>> Timestamp Source
>> Speed (kn) Latitude (°) Longitude (°) Course (°)
>> 1 149977066014997706604 minutes ago Terr-AIS
>> 8.6 37.36228 -9.176811 200
>
> There's undoubtedly a way to extract the time info from this HTML
> code (but I'm not the one to provide that solution.) This is the
> first
> entry in the webpage source when viewed right now:
>
> <td>
> <time class="toggle_time utc_time hide-me formatTime"
> data-overrideAbbreviation="empty" data-time="1499780737"
> data-timezone="0"><span>1499780737</span></time><time
> class="toggle_time my_time hide-me formatTime"
> data-overrideAbbreviation="MyT" data-time="1499780737"
> data-timezone="MyT"><span>1499780737</span></time><time
> class="toggle_time dif_time hide"><span>6 minutes
> ago</span></time> </td>
>
> So the POSIX-style time is givne twice (with no separator and then an
> offset in minutes is calculated. If I were given the records as you
> have them now I would extract the leading 10 digits and convert to
> POSIXct
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","", Timestamp)),
> origin="1970-01-01")
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","\\1", ais$Timestamp)),
> origin="1970-01-01")
> [1] "2017-07-11 07:09:57 PDT" "2017-07-11 06:45:37 PDT" "2017-07-11
> 06:40:34 PDT" "2017-07-11 06:34:46 PDT" "2017-07-11 06:32:26 PDT"
> [6] "2017-07-11 06:29:34 PDT" "2017-07-11 06:26:37 PDT" "2017-07-11
> 06:20:24 PDT" "2017-07-11 06:17:18 PDT" "2017-07-11 06:12:37 PDT"
> [11] "2017-07-11 06:08:24 PDT" "2017-07-11 06:05:04 PDT" "2017-07-11
> 06:03:27 PDT" "2017-07-11 05:59:37 PDT" "2017-07-11 05:54:37 PDT"
> [16] "2017-07-11 05:50:48 PDT" "2017-07-11 05:44:53 PDT" "2017-07-11
> 05:29:17 PDT" "2017-07-11 05:26:44 PDT" "2017-07-11 04:30:08 PDT"
>
> -- David.
>
>
>
>> 2 149977048714997704876 minutes ago Terr-AIS
>> 8.4 37.36875 -9.174048 200
>> 3 149976661414997666141 hour, 11 minutes ago Terr-AIS
>> 8.0 37.51499 -9.184502 149
>> 4 149976410714997641071 hour, 53 minutes ago Terr-AIS
>> 7.7 37.60513 -9.228263 169
>> 5 149976392714997639271 hour, 56 minutes ago Terr-AIS
>> 8.0 37.61168 -9.229627 167
>> 6 149976378014997637801 hour, 58 minutes ago Terr-AIS
>> 6.9 37.61626 -9.230456 132
>> 7 149976354014997635402 hours, 2 minutes ago Terr-AIS
>> 3.4 37.61758 -9.237928 101
>> 8 149976342014997634202 hours, 4 minutes ago Terr-AIS
>> 3.6 37.61808 -9.240235 116
>> 9 149976286114997628612 hours, 14 minutes ago Terr-AIS
>> 3.3 37.62267 -9.249762 112
>> 10 149976264714997626472 hours, 17 minutes ago Terr-AIS
>> 3.2 37.6241 -9.25348 115
>> 11 149976243014997624302 hours, 21 minutes ago Terr-AIS
>> 3.3 37.62557 -9.257295 112
>> 12 149976193714997619372 hours, 29 minutes ago Terr-AIS
>> 3.4 37.62903 -9.266028 111
>> 13 149976172814997617282 hours, 32 minutes ago Terr-AIS
>> 3.3 37.63051 -9.269725 114
>> 14 149976150714997615072 hours, 36 minutes ago Terr-AIS
>> 3.3 37.63207 -9.273642 120
>> 15 149976124714997612472 hours, 40 minutes ago Terr-AIS
>> 3.6 37.63395 -9.278254 118
>> 16 149976098714997609872 hours, 45 minutes ago Terr-AIS
>> 3.3 37.63586 -9.282853 116
>> 17 149976051014997605102 hours, 53 minutes ago Terr-AIS
>> 3.2 37.6392 -9.291157 118
>> 18 149975870414997587043 hours, 23 minutes ago Sat-AIS Add to SAT
>> Fleet for undelayed data <NA> <NA> <NA>
>> 19 149975631414997563144 hours, 3 minutes ago Terr-AIS
>> 3.0 37.68459 -9.321658 159
>> 20 149975598814997559884 hours, 8 minutes ago Terr-AIS
>> 3.3 37.68881 -9.324172 150
>>
>> The records correspond exactly to the first table. I have tried to
>> change the colClasses to character, but the results are always the
>> same. My question is: how to change the Timestamp to a POSIXlt POSIXct
>> format, either in the importing phase or within R?
>>
>> I hope that the question is clear now.
>>
>> Cristina
>>
>>
>> On Mon, 10 Jul 2017 09:18:05 -0700, Jeff Newmiller wrote:
>>> Not reproducible. [1][2][3] If our answers don't seem to apply to
>>> your situation, it will likely be because you did not explain your
>>> question clearly.
>>>
>>> Not plain text. This is a plain text mailing list, and the
>>> best-case
>>> scenario when you let your email program send HTML is that what you
>>> saw is not what we see (worst case is your email is scrambled on
>>> our
>>> end).
>>>
>>> Have you read the documentation for the function you are using? In
>>> particular, what about the colClasses argument? If you don't let
>>> readHTMLTable guess what the format is (have it read in as
>>> character
>>> data) then you have a fighting chance to get it right yourself,
>>> e.g.
>>>
>>> as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M
>>> (UTC)",
>>> tz="UTC" )
>>>
>>> -----
>>>
>>> [1]
>>>
>>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>>
>>> [2] http://adv-r.had.co.nz/Reproducibility.html
>>>
>>> [3] https://cran.r-project.org/web/packages/reprex/index.html
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 10, 2017 8:31:30 AM PDT, Cristina Silva <csilva at ipma.pt>
>>> wrote:
>>>> Hi,
>>>>
>>>> I am extracting positions data from the marine traffic website.
>>>> The
>>>> table has a "Timestamp" column which, in the browser, appears with
>>>> the
>>>> format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>>>>
>>>> When I import the table, the same date "2017-07-10 14:04 (UTC)"
>>>> appears
>>>>
>>>> as "1499696500149969650021 minutes ago", This is the more recent
>>>> date
>>>> and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear
>>>> as
>>>> e.g.
>>>> "1499619726149961972621 hours, 59 minutes ago".
>>>>
>>>> I don't know how to convert these data to the time formats used in
>>>> R
>>>> (POSIXct).
>>>>
>>>> The script is very simple and worked before:
>>>>
>>>> library(XML)
>>>> x <- readHTMLTable('url')
>>>>
>>>> where the 'url' is the link to the website with the specification
>>>> of
>>>> the
>>>> vessel.
>>>>
>>>> I appreciate any help.
>>>>
>>>> Cristina
>>>>
>
> David Winsemius
> Alameda, CA, USA
--
Cristina Silva
Divisão de Modelação e Gestão de Recursos Pesqueiros
Avenida de Brasília
1449-006 Lisboa
@: csilva at ipma.pt
#: +351 213027096
More information about the R-help
mailing list