[R] reading multiple text files from web

Kruti Pandya kp1005 at gmail.com
Thu Mar 12 07:08:31 CET 2015


I am trying to extract information “OS Vendor” and “OS Name” from the
following text file online.

http://spec.org/jEnterprise2010/results/res2013q3/jEnterprise2010-20130904-00045.txt


 My goal is to extract these two attributes from all the text files
available from  this link given below and put it in a dataframe as follows.

OS Vendor                                OS Name

Oracle Corporation    Oracle Solaris 11.1 64-bit SRU 10.5"k

Text files link :
https://www.spec.org/jEnterprise2010/results/jEnterprise2010.html


I got a list of all text files from the HTML page. I am trying to
write a function that can pick one link at a time from getlinks and
extract the attributes and then put it in a dataframe.  I do not know
how to read the files from getlinks object that contains the links.  I
tried converting getlinks to a dataframe via as.data.frame(getlinks)
but that got rid of the quotes that I need in order to read them one
by one. Also once I get the attributes how do I put them side by side
in the dataframe format.


###code#########

install.packages(c("RCurl","XML"))

library(bitops)

library(RCurl)

library(XML)

webpage = htmlParse("http://spec.org/jEnterprise2010/results/jEnterprise2010.html",error=function(...){},
useInternalNodes = TRUE)

links<- xpathSApply(webpage,"//a/@href")

getlinks<-links[grep(".txt",links)]

######### function to read all text files and extract attributes##########

readfiles=function(x) { a<-readLines(x)


sm <- "Java EE AppServer & Database Server HW (SUT
hardware)"


s<-grep(sm, a, fixed=TRUE)


e<-grep("^\\S", a[-(1:s)])[1]


grep("OS Vendor", a[(s+1):(s+e-1)], fixed=T, value=T)[1]

grep("OS Name", a[(s+1):(s+e-1)], fixed=T,
value=T)[1]

}

######### For single file was able to extract the attributes #########

txt1<-readLines("http://spec.org/jEnterprise2010/results/res2013q3/jEnterprise2010-20130904-00045.txt")

#Get the OS Vendor and OS Name

sm<- "Java EE AppServer & Database Server HW (SUT hardware)"

s<-grep(sm,txt1, fixed=TRUE)

e<-grep("^\\S",txt1[-(1:s)])[1]

grep("OS Vendor", txt1[(s+1):(s+e-1)], fixed=T, value=T)[1]


grep("OS Name", txt1[(s+1):(s+e-1)], fixed=T, value=T)[1]


Will appreciate any help !


Thanks.

	[[alternative HTML version deleted]]



More information about the R-help mailing list