[R] Extracting desired numbers from complicated lines of web pages
jim holtman
jholtman at gmail.com
Sun Aug 5 20:27:42 CEST 2012
try this: left as an exercise to the reader if these have to be
grouped by 'userid' which might be the case and therefore you might
want to check for non-existent values. Also on the last line you did
not say it there are only those three values, or could there be more.
input <- readLines(textConnection('
+ [1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
+
+ [2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
+
+ [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"
+
+ [4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
+
+ [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
+
+ [6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
+
+ [7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>
+
+ [[alternative HTML version deleted]]'))
>
> # extract the data by brute force and then break apart into a dataframe
> count <- lapply(input, function(.line){
+ if (grepl('[0-9]+ Friends', .line))
+ return(sub(".*>([0-9]+) (Friends).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Reviews", .line))
+ return(sub(".*>([0-9]+) (Reviews).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Review Update", .line))
+ return(sub(".*>([0-9]+) (Review Update).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ First", .line))
+ return(sub(".*>([0-9]+) (First).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Fans", .line))
+ return(sub(".*>([0-9]+) (Fans).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Local Photos", .line))
+ return(sub(".*>([0-9]+) (Local Photos).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Useful", .line))
+ return(c( # vector with multiple values
+ sub(".* ([0-9]+) (Useful).*", "\\1:\\2", .line)
+ , sub(".* ([0-9]+) (Funny).*", "\\1:\\2", .line)
+ , sub(".* ([0-9]+) (Cool).*", "\\1:\\2", .line)
+ ))
+ return(NULL)
+ })
>
> # create dataframe
> df <- data.frame(do.call(rbind, strsplit(unlist(count), ":")))
> names(df) <- c("Value", "Variable")
> df
Value Variable
1 108 Friends
2 151 Reviews
3 5 Review Update
4 1 First
5 2 Fans
6 54 Local Photos
7 2022 Useful
8 1591 Funny
9 1756 Cool
>
>
>
>
On Sun, Aug 5, 2012 at 11:16 AM, Shelby McIntyre <smcintyremobile at me.com> wrote:
> I need to extract the indicted (bold & underlined) numbers from lines coming off web pages.
>
> Of course I don't know ahead of time the location or length of the number. What I do know
> is the tag "Friends", and "Reviews", etc. In fact, it would be good to end up with
>
> Value Variable
> 108 Friends
> 151 Reviews
> 5 Review Updates
> NA First <-- assuming here that "First" did not show up on an line
> etc.
>
> Of particular trouble is line [7] which requires extracting 3 numbers 2022 (Useful), 1591 (Funny) and 1756 (Cool).
> ============== Extraction problem lines ===========
>
> [1] "\t\t\t<li id=\"friendCount\"><a href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 Friends</a></li>"
>
> [2] "\t\t\t<li id=\"reviewCount\"><a href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 Reviews</a></li>"
>
> [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"
>
> [4] "\t\t\t\t<li id=\"ftrCount\"><a href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1 First</a></li>"
>
> [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
>
> [6] "\t\t\t\t<li id=\"localPhotoCount\"><a href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local Photos</a></li>"
>
> [7] <p id="review_votes" class="smaller"><img src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif" alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
More information about the R-help
mailing list