Python More efficient way to do this using Python?

  • Thread starter Thread starter rollcast
  • Start date Start date
  • Tags Tags
    Python
AI Thread Summary
The discussion revolves around a Python coding challenge focused on extracting clickable web address links from a provided HTML snippet. The initial code attempts to locate and print URLs by searching for the `<a href=` tags, but the user acknowledges that the implementation could be improved for efficiency and robustness. Suggestions include using an XML parser to handle HTML more effectively, as it can create a structured document tree for easier navigation and extraction of `<a>` tags and their `href` attributes. Additionally, alternatives such as regular expressions are proposed to directly find URLs, although this method is noted to be error-prone. The conversation emphasizes the importance of using proper parsing techniques, especially in the context of building a search engine, which is the ultimate goal of the tutorial.
rollcast
Messages
403
Reaction score
0

Homework Statement


Its not really a homework problem as its just for a problem from an online Python tutorial but I thought it would fit in better here.

The problem is to find and then print out the all the web address links (clickable), from the source code of a web page.

The problem comes with a chunk of html code for you to find the links in it.

Html code:
Code:
<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN]  higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN]  Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN]  Life Design</a> conference.</p><br/></body><br/></html>

The Attempt at a Solution



My code in python:

Code:
page = '<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN]  higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN]  Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN]  Life Design</a> conference.</p><br/></body><br/></html>'
finished = False
start_link = 0
end_link = 0
previous_start_link = 0
while (True):
    start_link = page.find("<a href=", start_link + 1) + 9
    if (start_link < previous_start_link):
        break
    end_link = page.find('">', start_link)
    url = page[start_link:end_link]
    print url
    previous_start_link = start_link

However I realize that my implementation of a sort of "do-while" styled loop is quite chunky and probably could be refined along with my search method?

Thanks
Al
 
Last edited by a moderator:
Technology news on Phys.org
jedishrfu said:
the best way is to use an xml parser that will do the html parsing and present you with a tree of document components to look up what you want in this case <a> tags and what ther href attribute values are.

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

Thanks for your reply.

The series eventually should teach you how to make a basic search engine in Python so that parser could be useful later on.

However, what could I do to the code I have made to make it more efficient or better than it is now without vastly changing how it works?

Thanks
Al
 
well one thing I see is what if the "<a href=" is actually "<a href=" with extra spaces then your code will miss that url>

Can you search on say just http to find your strings? and then search for the next matching quote. Its more direct and avoids the "<a href=" extra spaces issue.

As far as your code goes it looks okay to me as a simple means of walking through the string of html code and finding the what you want.

An alternative to xml parsing would be a regular expression that extracted the string you wanted something along the lines of: /["]http[:].*["]/ to find all quoted strings that star with "http and end with "

import re

// get a list of strings matching the httpPattern
httplist = re.findall ( htpPattern, page )

However, parsing html in this manner is always tedious and error prone and that's why people tend to use xml parsing libraries and your teacher is probably trying the make that point with this exercise.
 
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I had a Microsoft Technical interview this past Friday, the question I was asked was this : How do you find the middle value for a dataset that is too big to fit in RAM? I was not able to figure this out during the interview, but I have been look in this all weekend and I read something online that said it can be done at O(N) using something called the counting sort histogram algorithm ( I did not learn that in my advanced data structures and algorithms class). I have watched some youtube...

Similar threads

Replies
2
Views
2K
Replies
2
Views
1K
Replies
9
Views
2K
Replies
3
Views
2K
Replies
4
Views
3K
Replies
3
Views
9K
Back
Top