1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

More efficient way to do this using Python?

  1. Jun 18, 2012 #1
    1. The problem statement, all variables and given/known data
    Its not really a homework problem as its just for a problem from an online Python tutorial but I thought it would fit in better here.

    The problem is to find and then print out the all the web address links (clickable), from the source code of a web page.

    The problem comes with a chunk of html code for you to find the links in it.

    Html code:
    Code (Text):
    <html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [Broken][/PLAIN] [Broken] higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [Broken][/PLAIN] [Broken] Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [Broken][/PLAIN] [Broken] Life Design</a> conference.</p><br/></body><br/></html>
    3. The attempt at a solution

    My code in python:

    Code (Text):
    page = '<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [Broken][/PLAIN] [Broken] higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [Broken][/PLAIN] [Broken] Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [Broken][/PLAIN] [Broken] Life Design</a> conference.</p><br/></body><br/></html>'
    finished = False
    start_link = 0
    end_link = 0
    previous_start_link = 0
    while (True):
        start_link = page.find("<a href=", start_link + 1) + 9
        if (start_link < previous_start_link):
            break
        end_link = page.find('">', start_link)
        url = page[start_link:end_link]
        print url
        previous_start_link = start_link
    However I realise that my implementation of a sort of "do-while" styled loop is quite chunky and probably could be refined along with my search method?

    Thanks
    Al
     
    Last edited by a moderator: May 6, 2017
  2. jcsd
  3. Jun 18, 2012 #2

    jedishrfu

    Staff: Mentor

  4. Jun 18, 2012 #3
    Thanks for your reply.

    The series eventually should teach you how to make a basic search engine in Python so that parser could be useful later on.

    However, what could I do to the code I have made to make it more efficient or better than it is now without vastly changing how it works?

    Thanks
    Al
     
  5. Jun 18, 2012 #4

    jedishrfu

    Staff: Mentor

    well one thing I see is what if the "<a href=" is actually "<a href=" with extra spaces then your code will miss that url>

    Can you search on say just http to find your strings? and then search for the next matching quote. Its more direct and avoids the "<a href=" extra spaces issue.

    As far as your code goes it looks okay to me as a simple means of walking through the string of html code and finding the what you want.

    An alternative to xml parsing would be a regular expression that extracted the string you wanted something along the lines of: /["]http[:].*["]/ to find all quoted strings that star with "http and end with "

    import re

    // get a list of strings matching the httpPattern
    httplist = re.findall ( htpPattern, page )

    However, parsing html in this manner is always tedious and error prone and that's why people tend to use xml parsing libraries and your teacher is probably trying the make that point with this exercise.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: More efficient way to do this using Python?
Loading...