Why Does My Python URL Extraction Code Not Work?

doktorwho · Nov 3, 2016

I am suppose to write a code that print put the url of a link given below. The url is defined to start where the first " appears and end where the last " url appears starting from the start_link. Its actually the last project from te lecture 1 in Udacity and the forst code of mine.. but its wring haha
Here it goes:
# Write Python code that assigns to the
# variable url a string that is the value
# of the first URL that appears in a link
# tag in the string page.
# Your code should print http://udacity.com
# Make sure that if page were changed to

# page = '<a href="http://udacity.com">Hello world</a>'

# that your code still assigns the same value to the variable 'url',
# and therefore still prints the same thing.

# page = contents of a web page
page =('<div id="top_bin"><div id="top_content" class="width960">'
'<div class="udacity float-left"><a href="http://udacity.com">')

start_link = page.find('<a href=')
new_page=page[start_link:]
num_ofstart=new_page.find(' " ')
new_page1=new_page[(num_ofstart+1):]
num_ofend=new_page1.find(' " ')
url=new_page[(num_ofstart):(num_ofend)]
print(url)
It prints out only "http://ud
Whats wrong?

NascentOxygen · Nov 3, 2016

doktorwho said:

num_ofend=new_page1.find(' " ')
➡[/color]
url=new_page[(num_ofstart):(num_ofend)]

In between these two lines add code to print out the values of new_page, num_ofstart, and num_ofend to make sure that you and your code are operating in sync.

Mark44 · Nov 3, 2016

In addition to what @NascentOxygen said, you have a problem with these two lines of code:

Python:

num_ofstart=new_page.find(' " ')
.
.
.
num_ofend=new_page1.find(' " ')

In each case, the character you should be searching for is ". What you are actually doing is searching for <space>"<space>. In other words, in the argument to the find() function, you have extra space characters before and after the double-quote. The string you're searching in doesn't contain a substring of <space>"<space>, so both calls to find() are returning -1.

One more thing - when you post code, especially Python code, surround your code with code tags.
What I did above looks like this:

Python:

num_ofstart=new_page.find(' " ')
.
.
.
num_ofend=new_page1.find(' " ')

Mark44 · Nov 3, 2016

When you're writing code, an important skill to develop is learning how to use a debugger. Python has a built-in debugger module, Pdb. It's somewhat primitive, but it's useful enough. I wrote a couple of Insights articles on this debugger last year.
https://www.physicsforums.com/insights/simple-python-debugging-pdb-part-1/
https://www.physicsforums.com/insights/simple-python-debugging-pdb-part-2/

Using this debugger I was able to get your code working.

Why Does My Python URL Extraction Code Not Work?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Four L-shaped members: Mechanical Analysis Problem

Engineering Joint and Marginal Distributions of a Randomly Selected Test Answer

Engineering Half wave voltage doubler

Truss analysis problem

Engineering Shear Stress Question (Rocker Arm & pin diameter)

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect