[R] grep help needed
Denis Chabot
chabotd at globetrotter.net
Tue Jul 26 04:46:13 CEST 2005
Hi,
In another thread ("PBSmapping and shapefiles") I asked for an easy
way to read "shapefiles" and transform them in data that PBSmapping
could use. One person is exploring some ways of doing this, but it is
possible I'll have to do this "manually".
With package "maptools" I am able to extract the information I need
from a shapefile but it is formatted like this:
[[1]]
[,1] [,2]
[1,] -55.99805 51.68817
[2,] -56.00222 51.68911
[3,] -56.01694 51.68911
[4,] -56.03781 51.68606
[5,] -56.04639 51.68759
[6,] -56.04637 51.69445
[7,] -56.03777 51.70207
[8,] -56.02301 51.70892
[9,] -56.01317 51.71578
[10,] -56.00330 51.73481
[11,] -55.99805 51.73840
attr(,"pstart")
attr(,"pstart")$from
[1] 1
attr(,"pstart")$to
[1] 11
attr(,"nParts")
[1] 1
attr(,"shpID")
[1] NA
[[2]]
[,1] [,2]
[1,] -57.76294 50.88770
[2,] -57.76292 50.88693
[3,] -57.76033 50.88163
[4,] -57.75668 50.88091
[5,] -57.75551 50.88169
[6,] -57.75562 50.88550
[7,] -57.75932 50.88775
[8,] -57.76294 50.88770
attr(,"pstart")
attr(,"pstart")$from
[1] 1
attr(,"pstart")$to
[1] 8
attr(,"nParts")
[1] 1
attr(,"shpID")
[1] NA
I do not quite understand the structure of this data object (list of
lists I think)
but at this point I resorted to printing it on the console and
imported that text into Excel for further cleaning, which is easy
enough. I'd like to complete the process within R to save time and to
circumvent Excel's limit of around 64000 lines. But I have a hard
time figuring out how to clean up this text in R.
What I need to produce for PBSmapping is a file where each block of
coordinates shares one ID number, called PID, and a variable POS
indicates the position of each coordinate within a "shape". All other
lines must disappear. So the above would become:
PID POS X Y
1 1 -55.99805 51.68817
1 2 -56.00222 51.68911
1 3 -56.01694 51.68911
1 4 -56.03781 51.68606
1 5 -56.04639 51.68759
1 6 -56.04637 51.69445
1 7 -56.03777 51.70207
1 8 -56.02301 51.70892
1 9 -56.01317 51.71578
1 10 -56.00330 51.73481
1 11 -55.99805 51.73840
2 1 -57.76294 50.88770
2 2 -57.76292 50.88693
2 3 -57.76033 50.88163
2 4 -57.75668 50.88091
2 5 -57.75551 50.88169
2 6 -57.75562 50.88550
2 7 -57.75932 50.88775
2 8 -57.76294 50.88770
First I imported this text file into R:
test <- read.csv2("test file.txt",header=F, sep=";", colClasses =
"character")
I used sep=";" to insure there would be only one variable in this
file, as it contains no ";"
To remove lines that do not contain coordinates, I used the fact that
longitudes are expressed as negative numbers, so with my very limited
knowledge of grep searches, I thought of this, which is probably not
the best way to go:
a <- rep("-", length(test$V1))
b <- grep(a, test$V1)
this gives me a warning ("Warning message:
the condition has length > 1 and only the first element will be used
in: if (is.na(pattern)) {"
but seems to do what I need anyway
c <- seq(1, length(test$V1))
d <- c %in% b
e <- test$V1[d]
Partial victory, now I only have lines that look like
[1,] -57.76294 50.88770
But I don't know how to go further: the number in square brackets can
be used for variable POS, after removing the square brackets and the
comma, but this requires a better knowledge of grep than I have.
Furthermore, I don't know how to add a PID (polygon ID) variable,
i.e. all lines of a polygon must have the same ID, as in the example
above (i.e. each time POS == 1, a new polygon starts and PID needs to
be incremented by 1, and PID is kept constant for lines where POS ! 1).
Any help will be much appreciated.
Sincerely,
Denis Chabot
More information about the R-help
mailing list