Extract URLs with regexp - Homework Solution

  • Thread starter Thread starter Trentonx
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around extracting URLs from a processed HTML file using regular expressions (regexp) in the context of a homework assignment. Participants explore various approaches to achieve the desired output of URLs while addressing challenges encountered in their attempts.

Discussion Character

  • Homework-related
  • Technical explanation
  • Exploratory

Main Points Raised

  • The original poster (OP) attempts to use sed to extract URLs but struggles with the regex patterns, specifically mentioning issues with matching and replacing portions of the string.
  • Trentonx suggests using the pattern "^.*http" to match everything before the URL and "\">.*$" to match everything after the URL, although notes that it remains untested.
  • The OP modifies the suggestion and uses '^.*="' to try to match right before the URL, expressing confusion about how the regex components function, particularly the role of the period (.) in the expressions.
  • Trentonx challenges the OP's modified approach, suggesting it should not work reliably and proposes an alternative pattern "s/^.*http/http/" for testing, while clarifying that the period (.) represents any character.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the most effective regex pattern to use, and there are competing suggestions regarding the correct approach. The discussion remains unresolved as participants continue to explore different regex options.

Contextual Notes

There are limitations in the regex patterns discussed, including potential issues with matching unintended characters and the need for further testing of proposed solutions.

Trentonx
Messages
36
Reaction score
0

Homework Statement


I have a file that contains lines like the following:
Code:
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]
It is already processed some from an html file, but I want the following to be the final output
Code:
http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations


Nothing much here

The Attempt at a Solution


My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks
 
Last edited by a moderator:
Physics news on Phys.org
Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.
 
That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.
 
Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K