Why Does My Python URL Extraction Code Not Work?

Click For Summary

Discussion Overview

The discussion revolves around troubleshooting a Python code snippet intended to extract a URL from an HTML link tag. Participants are examining the code's logic and identifying potential errors in its implementation.

Discussion Character

  • Technical explanation
  • Debugging

Main Points Raised

  • One participant describes the intended functionality of the code and notes that it is not working as expected, specifically that it only prints part of the URL.
  • Another participant suggests adding print statements to check the values of variables to ensure the code is functioning as intended.
  • A third participant points out that the issue lies in the use of extra space characters in the search strings for the find() function, which leads to incorrect results.
  • Another participant emphasizes the importance of learning to use a debugger, mentioning Python's built-in debugger module, Pdb, and shares links to articles about it.

Areas of Agreement / Disagreement

Participants generally agree that there are issues with the code, particularly regarding the search strings used in the find() function. However, there is no consensus on the best approach to resolve the problem, as different debugging strategies are suggested.

Contextual Notes

Limitations include the potential misunderstanding of how the find() function operates with respect to string formatting and the need for debugging skills to identify issues in code execution.

doktorwho
Messages
181
Reaction score
6
Thread moved from a technical forum, so homework template is missing
I am suppose to write a code that print put the url of a link given below. The url is defined to start where the first " appears and end where the last " url appears starting from the start_link. Its actually the last project from te lecture 1 in Udacity and the forst code of mine.. but its wring haha
Here it goes:
# Write Python code that assigns to the
# variable url a string that is the value
# of the first URL that appears in a link
# tag in the string page.
# Your code should print http://udacity.com
# Make sure that if page were changed to

# page = '<a href="http://udacity.com">Hello world</a>'

# that your code still assigns the same value to the variable 'url',
# and therefore still prints the same thing.

# page = contents of a web page
page =('<div id="top_bin"><div id="top_content" class="width960">'
'<div class="udacity float-left"><a href="http://udacity.com">')

start_link = page.find('<a href=')
new_page=page[start_link:]
num_ofstart=new_page.find(' " ')
new_page1=new_page[(num_ofstart+1):]
num_ofend=new_page1.find(' " ')
url=new_page[(num_ofstart):(num_ofend)]
print(url)
It prints out only "http://ud
Whats wrong?
 
Last edited by a moderator:
Physics news on Phys.org
doktorwho said:
num_ofend=new_page1.find(' " ')
➡[/color]
url=new_page[(num_ofstart):(num_ofend)]
In between these two lines add code to print out the values of new_page, num_ofstart, and num_ofend to make sure that you and your code are operating in sync.
 
In addition to what @NascentOxygen said, you have a problem with these two lines of code:
Python:
num_ofstart=new_page.find(' " ')
.
.
.
num_ofend=new_page1.find(' " ')
In each case, the character you should be searching for is ". What you are actually doing is searching for <space>"<space>. In other words, in the argument to the find() function, you have extra space characters before and after the double-quote. The string you're searching in doesn't contain a substring of <space>"<space>, so both calls to find() are returning -1.

One more thing - when you post code, especially Python code, surround your code with code tags.
What I did above looks like this:
Python:
num_ofstart=new_page.find(' " ')
.
.
.
num_ofend=new_page1.find(' " ')
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 34 ·
2
Replies
34
Views
6K
  • · Replies 5 ·
Replies
5
Views
2K