Is there a way to mass-save posts?

  • Thread starter Thread starter Simfish
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around methods for mass-saving posts from online forums, particularly focusing on preserving content from specific URLs. Participants explore various tools and techniques for downloading forum threads and files without needing to save each item individually, addressing both technical challenges and limitations of existing solutions.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant inquires about methods to mass save posts from a specific forum, expressing concern about potential deletion of content.
  • Another participant suggests using download managers like DownThemAll and Spiderzilla, but notes limitations regarding how these tools handle forum URLs.
  • A question is raised about the applicability of these tools to databases, indicating uncertainty about their effectiveness in that context.
  • Participants discuss the challenges of saving large threads, mentioning the need for programming knowledge to automate the saving of multiple pages.
  • Concerns are expressed about download managers renaming URLs in a way that complicates access to saved threads, suggesting a need for better link management.
  • One participant shares a positive experience with Spiderzilla after adjusting its settings, while also noting issues with sites requiring login credentials.
  • Another participant mentions that Httrack can download sites while preserving the original link structure, but acknowledges that some links may still be missed.
  • There is a suggestion to create a public database from saved content, though concerns about searchability are raised.
  • A side note discusses the use of Google Desktop in conjunction with these tools, highlighting its limitations in searching deep web content.

Areas of Agreement / Disagreement

Participants express a range of views on the effectiveness of different tools and methods for mass-saving posts, with no consensus on a single best approach. Some participants find certain tools useful, while others highlight significant limitations and challenges.

Contextual Notes

Limitations include the potential for download managers to rename URLs, issues with accessing content behind logins, and the variability in effectiveness of different tools based on user settings and website structures.

Who May Find This Useful

Individuals interested in preserving online forum content, developers looking for solutions to automate data saving, and users of download management tools may find this discussion relevant.

Simfish
Gold Member
Messages
811
Reaction score
2
Without going to each of them individually?

So I want to preserve memories from my past, and mass save posts from http://aok.heavengames.com/cgi-bin/aokcgi/display.cgi?action=t&fn=3 . archive.org has stopped collecting data from it, and I fear that the forums may be deleted to save server space when they become inactive. Is there a way to do it? [in lieu of asking admins?]

Also, is there a way to mass save files in a directory?
http://www.acm.caltech.edu/~niles/acm95b/

Like that? Without individually right-clicking save as, etc?

And is there a way to see a directory like that when the webpage already has an index.html?
 
Last edited by a moderator:
Computer science news on Phys.org
does that work on databases?
 
Wow, that's so awesome! Though a lot of the posts I saved had to come in the form of display.cgi_xxx.html (since display.cgi.html) turned out to be the name of the posts.
 
What about massive threads like this though?
https://www.physicsforums.com/showthread.php?t=304
It only links to a small number of all of the pages in the entire thread. On the other hand, it's entirely possible to write a program to mass save the URLs with a for(x = 1, x <= n, x++) loop as in...

https://www.physicsforums.com/showthread.php?t=304&page=2
https://www.physicsforums.com/showthread.php?t=304&page=3
...
https://www.physicsforums.com/showthread.php?t=304&page=13

But I don't know how to write such a program. there could be such an interface elsewhere.
 
Okay, the problem with such download managers applied to forums is that forum threads all have URLs with a label like "https://www.physicsforums.com/showthread.php?t=158733" - and then the download manager starts renaming the showthread.php into showthread(1).php, showthread(2).php, etc... So then I can't access the threads as links from a central forum directory.

The solution would be for a download manager to automatically rename links and change the main page such that it maps a 1-1 correspondence with the renamed links. Only that the links are renamed, but the main page links are not modified as such.

Spiderzilla addresses the problem, but only after downloading huge amounts of irrelevant data that goes beyond the mere links. Maybe there is an option to save links only up to one link deep that I should try...

hmm
http://forums.sjgames.com/showthread.php?t=24599&page=1&pp=10
 
Last edited:
YAY Spiderzilla is awesome once you set the mirroring depth to 2! The problem is though - what of websites that you need a log-in and password for?

http://www.httrack.com/html/step9_opt2.html

EDIT: Okay, httrack allows you to download sites using browser cookie settings. Because it arranges the downloaded site by the original site's relative link-structure, URLs with showthread.php are saved as different URLs. That will work for forum cgi scripts at the very least.

It's not perfect though - a few links are skipped for some reason. So when I right click and click page info, then links, on Firefox, a small minority of links are still directed towards the original URL [even if I disable the "scan for duplicate URLs" feature]

Related: http://forum.httrack.com/readmsg/8721/index.html
http://forum.httrack.com/search/index.html?js=1&lang=en&what=cgi&rule=EXACT

[at least I'm posting this just in case others google something similar - PF is listed pretty high on google]
 
Last edited:
The idea is great. Do you think creating a public database out of it? It would be like a complete FAQ section. But, again, a search for the right answer will be difficult.





______________________________
http://www.gomvents.com/
 
Last edited by a moderator:
On a side note - this is useful when used in conjunction with Google Desktop (Google isn't able to search for things out of the deep web). Still, Google Desktop is known to skip over some folders for some reason (and I can't configure it to update itself).

I corrected the abovementioned problem by using an old version of Httrack (the latest beta version has its bugs).

Httrack already has its forum FAQ. Personally, I find dynamic forum FAQs to be superior to static FAQs.
 

Similar threads

  • Sticky
  • · Replies 100 ·
4
Replies
100
Views
53K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
11K
  • · Replies 128 ·
5
Replies
128
Views
44K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 7 ·
Replies
7
Views
4K