# Extract URLs with regexp

## Homework Statement

I have a file that contains lines like the following:
Code:
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] [Broken]
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN] [Broken]
It is already processed some from an html file, but I want the following to be the final output
Code:
http://www.yahoo.com/
http://www.google.com/
I am using sed to edit the file line by line and substitute.

## Homework Equations

Nothing much here

## The Attempt at a Solution

My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks

Last edited by a moderator:

Related Engineering and Comp Sci Homework Help News on Phys.org
nvn
Homework Helper
Trentonx: Try "^.*http", instead of "*http". And then try "\">.*\$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.

That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.

nvn