[R] import file formatted RFC-822

Barry Rowlingson b.rowlingson at lancaster.ac.uk
Tue Apr 13 19:54:26 CEST 2010


On Tue, Apr 13, 2010 at 6:26 PM, Sebastian Kruk <residuo.solow at gmail.com> wrote:
> Dear R-list users:
>
> I would like to import a database of web robots,
> http://www.robotstxt.org/db/all.txt, it´s formatted RFC-822, ¿how can
> I do it?

 RFC822 looks very much like R's package DESCRIPTION files, and they
are read in using read.dcf because they are conformant to 'Debian
Control File' format. So I tried read.dcf on it:

 > robots = read.dcf("all.txt")
 > dim(robots)
 [1] 298  38

 so that's a matrix:

 > dimnames(robots)
[[1]]
NULL

[[2]]
 [1] "robot-id"                  "robot-name"
 [3] "robot-cover-url"           "robot-details-url"
 [5] "robot-owner-name"          "robot-owner-url"
 [7] "robot-owner-email"         "robot-status"
 [9] "robot-purpose"             "robot-type"
[11] "robot-platform"            "robot-availability"
[13] "robot-exclusion"           "robot-exclusion-useragent"
[15] "robot-noindex"             "robot-host"
[17] "robot-from"                "robot-useragent"
[19] "robot-language"            "robot-description"
[21] "robot-history"             "robot-environment"
[23] "modified-date"             "modified-by"
[25] "robot-nofollow"            "robot-owner-name2"
[27] "robot-owner-url2"          "robot-owner-email2"
[29] "robot-owner-name3"         "robot-owner-name4"
[31] "robot-environment1"        "robot-environment2"
[33] "robot-purpose1"            "robot-purpose2"
[35] "robot-purpose3"            "robot-platform1"
[37] "robot-description1"        "robot-description2"

 and I guess it pads out the columns so every row has every possible
variable value even if it doesn't exist in the record for that robot.

 Sorted?

Barry



More information about the R-help mailing list