Replace Broken links with archived sources

AI Thread Summary
Replacing broken links in physics forums with archived sources from archive.org is feasible but presents significant challenges due to the scale of nearly 700,000 threads. Automating the process could lead to complications, such as false positives from servers that respond atypically, and concerns about overwhelming archive.org with requests. While a bot could potentially check for archived versions and replace broken links, it would require careful programming to avoid issues with snapshot dates and server limitations. An alternative suggestion is to provide users with a direct link to archive.org for manual retrieval of archived pages. Overall, while the idea has merit, practical implementation remains complex and resource-intensive.
Manasan3010
Messages
38
Reaction score
3
I've seen some broken external links in physics forums Which have been changed to BROKEN as moderators. Is replacing the broken links with archived links(Ex. archive.org) a bad idea?
 
Physics news on Phys.org
No it's not a bad idea, but it's a matter of scale. We have near 700k threads. We'd require an army to go through them, check links and then replace broken ones with archived links.
 
Greg Bernhardt said:
No it's not a bad idea, but it's a matter of scale. We have near 700k threads. We'd require an army to go through them, check links and then replace broken ones with archived links.
Is there a bot changing broken links text to "broken", If so Can't you make the bot to check the availability of the link in Archive.org through their api and route the link to Archived Link?
 
Manasan3010 said:
Is there a bot changing broken links text to "broken", If so Can't you make the bot to check the availability of the broken link page and route the link to Archived Link?
That was automated, but a one time thing. It's my understanding that archive.org doesn't archive everything and is organized by snapshot date. How would a bot know what date it was archived on if it was? Sure, it's likely programmically possible, but a lot of work and we'd likely be blocked after sending archive.org hundreds of thousands of requests.

Also during that first run there were false positives found. Servers can respond with some less than standard responses and confuse our simple checker. It's not something I want to rely on doing all the time.
 
Greg Bernhardt said:
That was automated, but a one time thing. It's my understanding that archive.org doesn't archive everything and is organized by snapshot date. How would a bot know what date it was archived on if it was? Sure, it's likely programmically possible, but a lot of work and we'd likely be blocked after sending archive.org hundreds of thousands of requests.
Maybe you don't need the date. You can make a bot that will take a link from PF and then use the search option in Wayback machine, and if the search returns some results (except null), the bot will copy the URL of the latest snapshot and place it in PF.
 
Wrichik Basu said:
Maybe you don't need the date. You can make a bot that will take a link from PF and then use the search option in Wayback machine, and if the search returns some results (except null), the bot will copy the URL of the latest snapshot and place it in PF.

Let me know when it's ready :-p

Easiest solution is if a broken link is found, simply include a link to archive.org and they can do the rest :wink:
 
  • Haha
Likes Wrichik Basu
  • Informative
Likes Greg Bernhardt
Back
Top