Extract URLs with regexp - Homework Solution

  • Thread starter Thread starter Trentonx
  • Start date Start date
AI Thread Summary
The discussion focuses on extracting URLs from HTML-like lines using regular expressions in sed. The user initially attempts to match and remove unwanted text before and after the URLs but encounters issues with their regex patterns. Suggestions include using "^.*http" to match everything before the URL and "\">.*$" to remove text after it. Clarifications about regex components, such as the meaning of the period (.) as a wildcard for any character, are provided. The conversation emphasizes refining regex patterns to achieve the desired output of clean URLs.
Trentonx
Messages
36
Reaction score
0

Homework Statement


I have a file that contains lines like the following:
Code:
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]
It is already processed some from an html file, but I want the following to be the final output
Code:
http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations


Nothing much here

The Attempt at a Solution


My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks
 
Last edited by a moderator:
Physics news on Phys.org
Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.
 
That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.
 
Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.
 

Similar threads

Replies
3
Views
2K
Back
Top