• Support PF! Buy your school textbooks, materials and every day products Here!

Extract URLs with regexp

  • Thread starter Trentonx
  • Start date
  • #1
39
0

Homework Statement


I have a file that contains lines like the following:
Code:
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] [Broken]
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN] [Broken]
It is already processed some from an html file, but I want the following to be the final output
Code:
http://www.yahoo.com/
http://www.google.com/
I am using sed to edit the file line by line and substitute.

Homework Equations


Nothing much here


The Attempt at a Solution


My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks
 
Last edited by a moderator:

Answers and Replies

  • #2
nvn
Science Advisor
Homework Helper
2,128
32
Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.
 
  • #3
39
0
That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.
 
  • #4
nvn
Science Advisor
Homework Helper
2,128
32
Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.
 

Related Threads on Extract URLs with regexp

  • Last Post
Replies
3
Views
1K
Replies
4
Views
4K
  • Last Post
Replies
0
Views
4K
  • Last Post
Replies
3
Views
1K
Replies
0
Views
756
Replies
0
Views
790
Replies
1
Views
521
Replies
10
Views
1K
Replies
1
Views
355
Replies
4
Views
810
Top