Is there any way to web-scrape a website that's down?

  • Context: Python 
  • Thread starter Thread starter Eclair_de_XII
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around the challenges and methods of web-scraping a website that is currently down, specifically focusing on the tkinter documentation site. Participants explore various alternatives, including archived versions of the site and other resources for tkinter information.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant notes that while archive sites have captured the main site and page directory, they have failed to capture specific pages related to tkinter objects.
  • Another participant asserts that if a site is down, it cannot be scraped directly, but suggests using the Internet Archive's Wayback Machine as a potential solution.
  • A participant shares a specific link to the Wayback Machine for the effbot site, indicating that while archives exist, they may not contain the desired instruction pages.
  • Several links to alternative resources and documentation for tkinter are provided by participants, indicating the existence of other valuable materials.
  • One participant clarifies that Flask is not a GUI library, which may be relevant to the discussion of tkinter alternatives.
  • A participant mentions finding a working mirror of the site through a Google search, expressing a sense of relief at discovering this resource.
  • There is a question raised about whether a shared link is a mirror or a Google cache, indicating some uncertainty about the nature of the resource.
  • Another participant suggests that the recovered site is a scrape from the Wayback Machine, but this remains unverified.

Areas of Agreement / Disagreement

Participants express differing views on the effectiveness of various methods for accessing downed websites, with some advocating for the Wayback Machine while others highlight the limitations of archived pages. The discussion remains unresolved regarding the best approach to retrieve specific content from the downed site.

Contextual Notes

Participants note limitations in the availability of specific pages on archive sites and the potential confusion between mirrors and cached versions of the site. There is also a lack of consensus on the reliability of the resources shared.

Eclair_de_XII
Messages
1,082
Reaction score
91
TL;DR
I used to go to effbot.org for documentation on tkinter. But now it seems to be down. Sometimes I thought about writing a web-scraping script to record all the pages explaining the widgets and what-not of tkinter, but I'm wondering if that is even possible. I cannot even access the pages normally.
I tried Google-searching the site, and found several archive sites. Each archive site has archived the main site and page directory, yes. But every single archive site has seemed to fail to capture the pages on the tkinter objects. I confess that I had taken the site for granted. I'm aware of other tkinter documentation sites on the internet, and I am also aware that other GUI modules exist, like Flask; one user on here mentioned it to me once. All the same, I found effbot the most valuable for tkinter documentation.
 
Technology news on Phys.org
Short answer no. If you can’t see it how can you scrape it.

There is another way though. Try the internet archive wayback machine. They may have taken a snapshot of the site.

HTTPS://web.archive.org
 
Last edited:
https://web.archive.org/web/20200801000000*/effbot.org

I've found plenty of archives of the site, but the ones I have checked do not seem to have the instruction pages available. Frankly, it would be a bit hasslesome to check every single one; I'm considering using a web-scraping script to search for a working link. As mentioned earlier, the web archive seems to have the page directories but not the pages themselves. For example:

https://web.archive.org/web/20200703091947/http://effbot.org/tkinterbook
 
Is that a mirror or a Google cache of the site (aka snapshot)?
 
jedishrfu said:
Is that a mirror or a Google cache of the site (aka snapshot)?
According to the message on the site it is a scrape from Wayback Machine.
 
  • Haha
Likes   Reactions: jedishrfu

Similar threads

  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
7K
Replies
7
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K