[R] import file formatted RFC-822
Sebastian Kruk
residuo.solow at gmail.com
Wed Apr 14 19:20:39 CEST 2010
I have a problem, In a few cases "robot-exclusion-useragent" have 2 or
more values, is there a manner to fix it? For example, robot askjeeves
has three names.
2010/4/13 Barry Rowlingson <b.rowlingson en lancaster.ac.uk>:
> On Tue, Apr 13, 2010 at 6:26 PM, Sebastian Kruk <residuo.solow en gmail.com> wrote:
>> Dear R-list users:
>>
>> I would like to import a database of web robots,
>> http://www.robotstxt.org/db/all.txt, it´s formatted RFC-822, ¿how can
>> I do it?
>
> RFC822 looks very much like R's package DESCRIPTION files, and they
> are read in using read.dcf because they are conformant to 'Debian
> Control File' format. So I tried read.dcf on it:
>
> > robots = read.dcf("all.txt")
> > dim(robots)
> [1] 298 38
>
> so that's a matrix:
>
> > dimnames(robots)
> [[1]]
> NULL
>
> [[2]]
> [1] "robot-id" "robot-name"
> [3] "robot-cover-url" "robot-details-url"
> [5] "robot-owner-name" "robot-owner-url"
> [7] "robot-owner-email" "robot-status"
> [9] "robot-purpose" "robot-type"
> [11] "robot-platform" "robot-availability"
> [13] "robot-exclusion" "robot-exclusion-useragent"
> [15] "robot-noindex" "robot-host"
> [17] "robot-from" "robot-useragent"
> [19] "robot-language" "robot-description"
> [21] "robot-history" "robot-environment"
> [23] "modified-date" "modified-by"
> [25] "robot-nofollow" "robot-owner-name2"
> [27] "robot-owner-url2" "robot-owner-email2"
> [29] "robot-owner-name3" "robot-owner-name4"
> [31] "robot-environment1" "robot-environment2"
> [33] "robot-purpose1" "robot-purpose2"
> [35] "robot-purpose3" "robot-platform1"
> [37] "robot-description1" "robot-description2"
>
> and I guess it pads out the columns so every row has every possible
> variable value even if it doesn't exist in the record for that robot.
>
> Sorted?
>
> Barry
>
More information about the R-help
mailing list