Extract URLs with regexp - Homework Solution

Trentonx · Jan 22, 2011

Homework Statement

I have a file that contains lines like the following:

Code:

<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]

It is already processed some from an html file, but I want the following to be the final output

Code:

http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations

Nothing much here

The Attempt at a Solution

My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks

nvn · Jan 22, 2011

Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.

Trentonx · Jan 22, 2011

That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.

nvn · Jan 22, 2011

Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.

Extract URLs with regexp - Homework Solution

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Homework Statement

Homework Equations

The Attempt at a Solution

Similar threads

Bending Stress and Shear Stress

Four L-shaped members: Mechanical Analysis Problem

Engineering Full bridge circuit with inductor and resistor

Engineering AM-AM and AM-PM graph generation in LTSpice

Engineering Enumeration vs Bisection Strategy

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers