Converting Relative URLs to Absolute URLs

  • Thread starter Thread starter Simfish
  • Start date Start date
  • Tags Tags
    Absolute Relative
Click For Summary
SUMMARY

This discussion focuses on converting relative URLs to absolute URLs for web pages downloaded using tools like HTTrack or DownThemAll. Users face challenges when saving temporary PHP search pages, as the relative URLs do not point to accessible locations on an external server. The conversation highlights the need for a program that can parse HTML pages to convert relative URLs into absolute ones, with suggestions for programming languages such as C, Perl, and Python for custom solutions. Additionally, it mentions the importance of using mirroring programs that automatically handle URL conversions.

PREREQUISITES
  • Understanding of HTML structure and URL formats
  • Familiarity with web scraping tools like HTTrack and DownThemAll
  • Basic knowledge of programming languages such as C, Perl, or Python
  • Experience with webpage mirroring techniques
NEXT STEPS
  • Research how to use HTTrack's advanced options for URL rewriting
  • Learn about Python libraries for HTML parsing, such as Beautiful Soup
  • Explore Perl modules for URL manipulation, like URI::URL
  • Investigate other mirroring programs that support automatic URL conversion
USEFUL FOR

Web developers, data scrapers, and anyone involved in web archiving or content migration who needs to convert relative URLs to absolute URLs effectively.

Simfish
Gold Member
Messages
811
Reaction score
2
Within all pages within a folder.

So I downloaded a number of pages that have links to pages that I want to download (using a utility such as HTTrack or DownThemAll). The problem is that the URLs of the pages are all relative, so when I save the pages to an external server (I have to do that, since the pages are temporary PHP search pages that HTTrack could not mirror), the URLs point to pages within the server that I cannot access.

So example...

http://the-scholars.com/viewtopic.php?t=10151

is converted to...

http://students.washington.edu/achen89/kong/viewtopic.php?t=10151 (where I saved the search page to).

So is there a program that allows one to parse all relative URLs to absolute URLs within an HTML page? (so that I can then use Httrack/DownThemAll on the saved page and then mirror all links within that page) Does the program have to be coded in a particular language? (C, Perl, Python?) I tried searching for one and found it at perlmonks.com, but couldn't turn it into a working .exe file (though I have no experience in compiling Perl)
 
Last edited by a moderator:
Computer science news on Phys.org
On a side note, an alternative would be to "trick" the website tracker into interpreting the base directory of the URL as http://the-scholars.com
 
Look for webpage mirroring programs that support this. Any good mirroring program will do this automatically though.

I used to know of a good one that did this automatically, but I've long since forgotten about it. If I run across it somewhere, I'll let you know.
 
Last edited:

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • Sticky
  • · Replies 0 ·
Replies
0
Views
5K
  • · Replies 2 ·
Replies
2
Views
475
  • · Replies 14 ·
Replies
14
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 18 ·
Replies
18
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 14 ·
Replies
14
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
8K