Barry Rowlingson b.rowlingson at lancaster.ac.uk
Tue Apr 13 19:54:26 CEST 2010

On Tue, Apr 13, 2010 at 6:26 PM, Sebastian Kruk <residuo.solow at gmail.com> wrote:
> Dear R-list users:
> I would like to import a database of web robots,
> http://www.robotstxt.org/db/all.txt, it´s formatted RFC-822, ¿how can
> I do it?

 RFC822 looks very much like R's package DESCRIPTION files, and they
are read in using read.dcf because they are conformant to 'Debian
Control File' format. So I tried read.dcf on it:

 > robots = read.dcf("all.txt")
 > dim(robots)
 [1] 298  38

 so that's a matrix:

 > dimnames(robots)

 [1] "robot-id"                  "robot-name"
 [3] "robot-cover-url"           "robot-details-url"
 [5] "robot-owner-name"          "robot-owner-url"
 [7] "robot-owner-email"         "robot-status"
 [9] "robot-purpose"             "robot-type"
[11] "robot-platform"            "robot-availability"
[13] "robot-exclusion"           "robot-exclusion-useragent"
[15] "robot-noindex"             "robot-host"
[17] "robot-from"                "robot-useragent"
[19] "robot-language"            "robot-description"
[21] "robot-history"             "robot-environment"
[23] "modified-date"             "modified-by"
[25] "robot-nofollow"            "robot-owner-name2"
[27] "robot-owner-url2"          "robot-owner-email2"
[29] "robot-owner-name3"         "robot-owner-name4"
[31] "robot-environment1"        "robot-environment2"
[33] "robot-purpose1"            "robot-purpose2"
[35] "robot-purpose3"            "robot-platform1"
[37] "robot-description1"        "robot-description2"

 and I guess it pads out the columns so every row has every possible
variable value even if it doesn't exist in the record for that robot.



