[R] rowspan and readHTMLTable

Chris Stubben stubben at lanl.gov
Wed May 8 18:52:51 CEST 2013


Sorry to answer my own question - I guess here's one way to read this 
table.  Other suggestions are still welcome.

Chris

------

x<-htmlParse("<table>
<tr><td rowspan=2>ab</td><td>X</td></tr>
<tr><td rowspan=2>YZ</td></tr>
<tr><td>c</td></tr>
</table>")

# split by rows
z <- getNodeSet(x, "//tr")

# create empty data.frame - probably not the best solution...
t1<- data.frame(matrix(NA, nrow = 3,  ncol = 2 ))

for (i in 1:3){
   rowspan <- as.numeric( xpathSApply(z[[i]], ".//td", xmlGetAttr, 
"rowspan", 1) )
   val <- xpathSApply(z[[i]], ".//td", xmlValue)

   # fill values into empty cells
   n <- which(is.na(t1[i,]))
   t1[ i ,n] <- val

   if( any(rowspan > 1) ){
      for(j in 1:length( rowspan ) ){
         if(rowspan[j] > 1){
             ## repeat value down column
               t1[ (i+1):(i+ ( rowspan[j] -1) ) , n[j] ]   <- val[j]
         }
      }
   }
}


t1
  X1 X2
1 ab  X
2 ab YZ
3  c YZ


If you are interested, I used this code in the pmcTable function at 
https://github.com/cstubben/pubmed .  To get  Table 1, this now works...

doc<-pmc("PMC3544749")  # downloads XML from OAI service
t1 <- pmcTable(doc,1)  # parse table... also saves caption and footnotes 
to attributes
 t1[1:4,1:4]
                           Category Gen Name Rv 
number                                      Description
1 Lipids and Fatty Acid Metabolism     kasB    Rv2246 
3-oxoacyl-[acyl-carrier protein] synthase 2 kasb
2           Mycolic acid synthesis    mmaA4   Rv0642c                  
Methoxy mycolic acid synthase 4
3           Mycolic acid synthesis     pcaA   Rv0470c    Mycolic acid 
synthase (cyclopropane synthase)
4           Mycolic acid synthesis     pcaA   Rv0470c    Mycolic acid 
synthase (cyclopropane synthase)




-- 

Chris Stubben

Los Alamos National Lab
Bioscience Division
MS M888
Los Alamos, NM 87545



More information about the R-help mailing list