[R] regex - extracting src url
Omar André Gonzáles Díaz
oma.gonzales at gmail.com
Tue Mar 22 05:44:22 CET 2016
Hi,I have a DF with a column with "html", like this:
<IMG SRC="
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">
I need to get this:
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?
I've got this so far:
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement
With this is the code I've used:
carreras_normal$Impression.Tag..image. <-
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
ignore.case = T)
*But I still need to use get rid of this part:*
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*
Thank you for your help.
Omar Gonzáles.
[[alternative HTML version deleted]]
More information about the R-help
mailing list