More efficient way to do this using Python?

  • Thread starter rollcast
  • Start date
  • Tags
    Python
In summary, the problem at hand is to find and print out all the web address links from a given web page's source code. The solution provided is a code in Python which uses a "do-while" style loop to search for and extract the links. However, a more efficient approach would be to use an XML parser, which can present the code in a tree-like structure and make it easier to search for specific elements, such as the <a> tags and their href attribute values. Another alternative is using regular expressions to extract the desired strings, but this can also be tedious and error-prone. The ultimate goal of this exercise is to teach how to build a basic search engine in Python.
  • #1
rollcast
408
0

Homework Statement


Its not really a homework problem as its just for a problem from an online Python tutorial but I thought it would fit in better here.

The problem is to find and then print out the all the web address links (clickable), from the source code of a web page.

The problem comes with a chunk of html code for you to find the links in it.

Html code:
Code:
<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN]  higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN]  Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN]  Life Design</a> conference.</p><br/></body><br/></html>

The Attempt at a Solution



My code in python:

Code:
page = '<html xmlns="http://www.w3.org/1999/xhtml"><br/> <head><br/><title>Udacity</title> <br/></head><br/><br/><body> <br/><h1>Udacity</h1><br/><br/> <p><b>Udacity</b> is a private institution of <a href="[PLAIN]http://www.wikipedia.org/wiki/Higher_education">[/PLAIN] [/PLAIN]  higher education founded by</a> <a href="[PLAIN]http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian[/PLAIN] [/PLAIN]  Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".<br/>It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="[PLAIN]http://www.wikipedia.org/wiki/Digital_Life_Design">Digital[/PLAIN] [/PLAIN]  Life Design</a> conference.</p><br/></body><br/></html>'
finished = False
start_link = 0
end_link = 0
previous_start_link = 0
while (True):
    start_link = page.find("<a href=", start_link + 1) + 9
    if (start_link < previous_start_link):
        break
    end_link = page.find('">', start_link)
    url = page[start_link:end_link]
    print url
    previous_start_link = start_link

However I realize that my implementation of a sort of "do-while" styled loop is quite chunky and probably could be refined along with my search method?

Thanks
Al
 
Last edited by a moderator:
Technology news on Phys.org
  • #3
jedishrfu said:
the best way is to use an xml parser that will do the html parsing and present you with a tree of document components to look up what you want in this case <a> tags and what ther href attribute values are.

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

Thanks for your reply.

The series eventually should teach you how to make a basic search engine in Python so that parser could be useful later on.

However, what could I do to the code I have made to make it more efficient or better than it is now without vastly changing how it works?

Thanks
Al
 
  • #4
well one thing I see is what if the "<a href=" is actually "<a href=" with extra spaces then your code will miss that url>

Can you search on say just http to find your strings? and then search for the next matching quote. Its more direct and avoids the "<a href=" extra spaces issue.

As far as your code goes it looks okay to me as a simple means of walking through the string of html code and finding the what you want.

An alternative to xml parsing would be a regular expression that extracted the string you wanted something along the lines of: /["]http[:].*["]/ to find all quoted strings that star with "http and end with "

import re

// get a list of strings matching the httpPattern
httplist = re.findall ( htpPattern, page )

However, parsing html in this manner is always tedious and error prone and that's why people tend to use xml parsing libraries and your teacher is probably trying the make that point with this exercise.
 
  • #5
ison

I understand the importance of efficiency in problem-solving. In this case, using Python to find and print out web links from a webpage can definitely be done in a more efficient way.

One approach could be to use the BeautifulSoup library in Python, which is specifically designed for parsing HTML and XML files. It allows you to easily navigate through the HTML structure and extract the information you need, such as web links.

Another approach could be to use regular expressions to search for patterns in the HTML code that indicate a web link, and then extract the link using the re library in Python.

Both of these methods would likely be more efficient and concise than your current approach. I would recommend exploring and experimenting with different methods to find the most efficient solution for your problem.
 

Related to More efficient way to do this using Python?

1. What is Python and why is it used for efficiency?

Python is a high-level, interpreted programming language that is commonly used for its simplicity, readability, and versatility. It is often used for tasks that require efficiency due to its built-in libraries and modules that can automate processes and handle complex data structures.

2. How can Python be used to improve efficiency in data analysis?

Python has powerful data analysis libraries such as Pandas and NumPy that allow for efficient data manipulation, cleaning, and analysis. These libraries also have built-in functions for handling large datasets, making it a popular choice for data scientists and analysts.

3. Is Python suitable for web development and how can it improve efficiency in this area?

Yes, Python is suitable for web development with frameworks like Django and Flask. These frameworks provide a streamlined and efficient way to build web applications by minimizing the amount of code needed and handling tasks such as URL routing and database integration.

4. Can Python be used for automating repetitive tasks and how does it save time?

Yes, Python has modules such as Selenium and PyAutoGUI that can automate repetitive tasks such as web scraping, data entry, and GUI interactions. By automating these tasks, it saves time and effort for users, allowing them to focus on more important tasks.

5. Are there any tips for writing efficient code in Python?

Some tips for writing efficient code in Python include using built-in functions and libraries, optimizing data structures, avoiding unnecessary loops and repetitions, and using list comprehensions. It is also important to regularly test and profile code to identify and fix any bottlenecks in the program.

Similar threads

  • Programming and Computer Science
Replies
2
Views
880
  • Programming and Computer Science
Replies
2
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
3
Views
2K
  • Programming and Computer Science
Replies
9
Views
2K
  • Programming and Computer Science
Replies
6
Views
5K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
3
Views
286
  • Programming and Computer Science
Replies
1
Views
7K
  • Programming and Computer Science
Replies
4
Views
2K
  • Programming and Computer Science
Replies
7
Views
5K
  • General Math
Replies
3
Views
9K
Back
Top