- #1
rollcast
- 408
- 0
Homework Statement
Its not really a homework problem as its just for a problem from an online Python tutorial but I thought it would fit in better here.
The problem is to find and then print out the all the web address links (clickable), from the source code of a web page.
The problem comes with a chunk of html code for you to find the links in it.
Html code:
Code:
<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN] higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN] Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN] Life Design</a> conference.</p><br/></body><br/></html>
The Attempt at a Solution
My code in python:
Code:
page = '<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN] higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN] Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN] Life Design</a> conference.</p><br/></body><br/></html>'
finished = False
start_link = 0
end_link = 0
previous_start_link = 0
while (True):
start_link = page.find("<a href=", start_link + 1) + 9
if (start_link < previous_start_link):
break
end_link = page.find('">', start_link)
url = page[start_link:end_link]
print url
previous_start_link = start_link
However I realize that my implementation of a sort of "do-while" styled loop is quite chunky and probably could be refined along with my search method?
Thanks
Al
Last edited by a moderator: