[R] Parsing aspects of a url path in R

Thu Mar 6 22:01:24 CET 2014

Hi,

In addition, you could also do:

gsub(".*www\\.([[:alnum:]]+\\.[[:alnum:]]+).*","\\1",url)
#[1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"
 gsub(".*www\\.([[:alnum:]]+\\.[[:alnum:]]+).*","\\1",url2)
#[1] "mdd.com"    "mdd.com"    "mdd.edu"    "genius.gov" "google.com"

gsub(".*www\\.[[:alnum:]]+\\.[[:alnum:]]+","",url)
#[1] "/food/pizza/index.html"     "/build-your-own/index.html"
#[3] "/special-deals.html"        "/find-a-location.html"     
#[5] "/hello.html"               
 gsub(".*www\\.[[:alnum:]]+\\.[[:alnum:]]+","",url2)
#[1] "/food/pizza/index.html"     "/build-your-own/index.html"
#[3] "/special-deals.html"        "/find-a-location.html"     
#[5] "/hello.html"            

A.K.

On Thursday, March 6, 2014 3:50 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
There are many ways to do this. Here's a simple version and a slightly
fancier version:

url = c("http://www.mdd.com/food/pizza/index.html",
"http://www.mdd.com/build-your-own/index.html",
"http://www.mdd.com/special-deals.html",
"http://www.genius.com/find-a-location.html",
"http://www.google.com/hello.html")

url2 = c("http://www.mdd.com/food/pizza/index.html",
"https://www.mdd.com/build-your-own/index.html",
"http://www.mdd.edu/special-deals.html",
"http://www.genius.gov/find-a-location.html",
"http://www.google.com/hello.html")

parse1 <- function(x) {
    # will work for https as well as http
    x <- sub("^http[s]*:\\/\\/", "", x)
    x <- sub("^www\\.", "", x)
    strsplit(x, "/")[[1]][1]
}

parse2 <- function(x) {
    # if you're sure it will always be .com
    strsplit(x, "\\.com")[[1]][2]
}

parse2a <- function(x) {
    # one way to split at any three-letter extension
    # assumes !S! won't appear in the URLs
    x <- sub("\\.[a-z]{3,3}\\/", "!S!\\/", x)
    strsplit(x, "!S!")[[1]][2]
}

sapply(url, parse1)
sapply(url, parse2)

sapply(url2, parse1)
sapply(url2, parse2a)

Sarah

On Thu, Mar 6, 2014 at 12:23 PM, Abraham Mathew <abmathewks at gmail.com> wrote:
> Let's say that I have the following character vector with a series of url
> strings. I'm interested in extracting some information from each string.
>
> url = c("http://www.mdd.com/food/pizza/index.html", "
> http://www.mdd.com/build-your-own/index.html",
>         "http://www.mdd.com/special-deals.html", "
> http://www.genius.com/find-a-location.html",
>         "http://www.google.com/hello.html")
>
> - First, I want to extract the domain name followed by .com. After
> struggling with this for a while, reading some regular expression
> tutorials, and reading through stack overflow, I came up with the following
> solution. Perfect!
>
>> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://",
> "", x), "/"), "[[", 1))
>> parser(url)
> [1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"
>
> - Second, I want to extract everything after .com in the original url.
> Unfortunately, I don't know the proper regular expression to assign in
> order to get the desired result. Can anyone help.
>
> Output should be
> /food/pizza/index.html
> build-your-own/index.html
> /special-deals.html
>
> If anyone has a solution using the stringr package, that'd be of interest
> also.
>
>
> Thanks.
>

-- 
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.