[R] reading multiple text files from web
Kruti Pandya
kp1005 at gmail.com
Thu Mar 12 07:08:31 CET 2015
I am trying to extract information “OS Vendor” and “OS Name” from the
following text file online.
http://spec.org/jEnterprise2010/results/res2013q3/jEnterprise2010-20130904-00045.txt
My goal is to extract these two attributes from all the text files
available from this link given below and put it in a dataframe as follows.
OS Vendor OS Name
Oracle Corporation Oracle Solaris 11.1 64-bit SRU 10.5"k
Text files link :
https://www.spec.org/jEnterprise2010/results/jEnterprise2010.html
I got a list of all text files from the HTML page. I am trying to
write a function that can pick one link at a time from getlinks and
extract the attributes and then put it in a dataframe. I do not know
how to read the files from getlinks object that contains the links. I
tried converting getlinks to a dataframe via as.data.frame(getlinks)
but that got rid of the quotes that I need in order to read them one
by one. Also once I get the attributes how do I put them side by side
in the dataframe format.
###code#########
install.packages(c("RCurl","XML"))
library(bitops)
library(RCurl)
library(XML)
webpage = htmlParse("http://spec.org/jEnterprise2010/results/jEnterprise2010.html",error=function(...){},
useInternalNodes = TRUE)
links<- xpathSApply(webpage,"//a/@href")
getlinks<-links[grep(".txt",links)]
######### function to read all text files and extract attributes##########
readfiles=function(x) { a<-readLines(x)
sm <- "Java EE AppServer & Database Server HW (SUT
hardware)"
s<-grep(sm, a, fixed=TRUE)
e<-grep("^\\S", a[-(1:s)])[1]
grep("OS Vendor", a[(s+1):(s+e-1)], fixed=T, value=T)[1]
grep("OS Name", a[(s+1):(s+e-1)], fixed=T,
value=T)[1]
}
######### For single file was able to extract the attributes #########
txt1<-readLines("http://spec.org/jEnterprise2010/results/res2013q3/jEnterprise2010-20130904-00045.txt")
#Get the OS Vendor and OS Name
sm<- "Java EE AppServer & Database Server HW (SUT hardware)"
s<-grep(sm,txt1, fixed=TRUE)
e<-grep("^\\S",txt1[-(1:s)])[1]
grep("OS Vendor", txt1[(s+1):(s+e-1)], fixed=T, value=T)[1]
grep("OS Name", txt1[(s+1):(s+e-1)], fixed=T, value=T)[1]
Will appreciate any help !
Thanks.
[[alternative HTML version deleted]]
More information about the R-help
mailing list