Extract URLs with regexp - Homework Solution

Trentonx · Jan 22, 2011

Homework Statement

I have a file that contains lines like the following:

Code:

<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]

It is already processed some from an html file, but I want the following to be the final output

Code:

http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations

Nothing much here

The Attempt at a Solution

My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks

nvn · Jan 22, 2011

Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.

Trentonx · Jan 22, 2011

That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.

nvn · Jan 22, 2011

Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.

Extract URLs with regexp - Homework Solution

Homework Statement

Homework Equations

The Attempt at a Solution

Thread 'How Do I Draw This Shear and Moment Diagram?'

Similar threads

Hot Threads

Engineering Why is my output current so low in this Transconductance Amplifier cell?

LTspice: Implementing a Single Balanced BJT Mixer

Max water pressure allowable on solar panels

Spiral scissor lift statics

Final project ideas using Noether's theorem in simulation class

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective