More efficient way to do this using Python?

jedishrfu · Jun 18, 2012

the best way is to use an xml parser that will do the html parsing and present you with a tree of document components to look up what you want in this case <a> tags and what ther href attribute values are.

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

rollcast · Jun 18, 2012

jedishrfu said:

the best way is to use an xml parser that will do the html parsing and present you with a tree of document components to look up what you want in this case <a> tags and what ther href attribute values are.

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

Thanks for your reply.

The series eventually should teach you how to make a basic search engine in Python so that parser could be useful later on.

However, what could I do to the code I have made to make it more efficient or better than it is now without vastly changing how it works?

Thanks
Al

jedishrfu · Jun 18, 2012

well one thing I see is what if the "<a href=" is actually "<a href=" with extra spaces then your code will miss that url>

Can you search on say just http to find your strings? and then search for the next matching quote. Its more direct and avoids the "<a href=" extra spaces issue.

As far as your code goes it looks okay to me as a simple means of walking through the string of html code and finding the what you want.

An alternative to xml parsing would be a regular expression that extracted the string you wanted something along the lines of: /["]http[:].*["]/ to find all quoted strings that star with "http and end with "

import re

// get a list of strings matching the httpPattern
httplist = re.findall ( htpPattern, page )

However, parsing html in this manner is always tedious and error prone and that's why people tend to use xml parsing libraries and your teacher is probably trying the make that point with this exercise.

Homework Statement

The Attempt at a Solution

More efficient way to do this using Python?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect