[R] regex - extracting src url

Omar André Gonzáles Díaz oma.gonzales at gmail.com
Tue Mar 22 05:44:22 CET 2016


Hi,I have a DF with a column with "html", like this:

<IMG SRC="
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">


I need to get this:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?


I've got this so far:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement


With this is the code I've used:

carreras_normal$Impression.Tag..image. <-
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
                                  ignore.case = T)



*But I still need to use get rid of this part:*


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*


Thank you for your help.

Omar Gonzáles.

	[[alternative HTML version deleted]]



More information about the R-help mailing list