How to download entire Forums in PF

  • Thread starter Thread starter Rainbows_
  • Start date Start date
AI Thread Summary
Users are seeking ways to download entire forum archives for offline reading, specifically for the Special and General Relativity sections. HTTrack is mentioned as a potential tool, but users report issues with it detecting empty mirrors and concerns about legality and server bandwidth. There are warnings against excessive downloading, as it can strain server resources and is generally discouraged by site administrators. Some users express a desire for a paid option to access archived content, while others suggest manually saving threads as a more responsible approach. The discussion highlights the tension between user needs for information preservation and the operational limitations of forum servers.
Rainbows_
I want to download the entire Special and General Relativity forum messages archives so I can read them offline and do searches as there are so many gems inside. What software must I use to download. Manually saving each thread would take too long. Thanks.
 
Physics news on Phys.org
rootone said:
Something like this should do what you want.
https://www.httrack.com/

Have you done it successfully? It says "HTTrack has detected that the mirror is empty".

Isn't this illegal or discouraged by web owners? If it is then, then let's transfer our messages to private conversation. If anyone has successfully downloaded an entire forum, please private message me if you don't want to share it publicly. Thanks.
 
here's the error log:

HTTrack3.49-2+htsswf+htsjava launched on Thu, 27 Jul 2017 10:48:10 at https://www.physicsforums.com +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
(winhttrack -qYC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" -%l "en, *" -Y https://www.physicsforums.com -O1 "d:\My Web Sites\p6" +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
10:48:11 Warning: Moved Permanently for www.physicsforums.com/robots.txt
10:48:11 Warning: Redirected link is identical because of 'URL Hack' option: www.physicsforums.com/robots.txt and https://www.physicsforums.com/robots.txt
10:48:11 Warning: Warning moved treated for www.physicsforums.com/robots.txt (real one is https://www.physicsforums.com/robots.txt)
10:48:11 Warning: Moved Permanently for www.physicsforums.com/
10:48:11 Warning: Redirected link is identical because of 'URL Hack' option: www.physicsforums.com/ and https://www.physicsforums.com/
10:48:11 Warning: File has moved from www.physicsforums.com/ to https://www.physicsforums.com/
10:48:11 Warning: No data seems to have been transferred during this session! : restoring previous one!
 
No it's not illegal, in fact the original internet encouraged that sort of thing.
I haven't tried it on this site, but I have on others.
You can't download the database of a site, but you can download all the HTML and JScrpit, Images, etc.
Downloading the actual database of a site is not something most site admins would want to agree to
 
Last edited:
Please don't do this. It can kill our bandwidth.
 
  • Like
Likes davenn and symbolipoint
Greg Bernhardt said:
Please don't do this. It can kill our bandwidth.

Ok. And I think your robots are guarding the place to avoid any downloading... anyway. Hehe...
 
  • Like
Likes Greg Bernhardt
rootone said:
No it's not illegal, in fact the original internet encouraged that sort of thing.
I haven't tried it on this site, but I have on others.
You can't download the database of a site, but you can download all the HTML and JScrpit, Images, etc.
Downloading the actual database of a site is not something most site admins would want to agree to

You mean even in other web sites with forums.. you can't download the messages too?

I hope there is option even for a paid archive collection retrieval.. Maybe Greg Bernhardt can offer this someday?
 
Rainbows_ said:
I hope there is option even for a paid archive collection retrieval.. Maybe Greg Bernhardt can offer this someday?
What prevents you from staying online?
 
  • #10
Greg Bernhardt said:
What prevents you from staying online?

Just for backup. In case the entire database gets wiped out.. for example from EMP from north korea or other events you never expected (like CME burst).
 
  • #11
Rainbows_ said:
Just for backup. In case the entire database gets wiped out.. for example from EMP from north korea or other events you never expected (like CME burst).
If something like that happens you have more important things to think about than loading up your backup of PF ;)
 
  • #12
Greg Bernhardt said:
If something like that happens you have more important things to think about than loading up your backup of PF ;)

Or just a virus or hack that can destroy the database (don't you get worried). The contents are gems and they can recreate 21th century physics if we were back to say the time of Newton :)
 
  • #13
Rainbows_ said:
Or just a virus or hack that can destroy the database (don't you get worried). The contents are gems and they can recreate 21th century physics if we were back to say the time of Newton :)
Don't worry, I have backups :)
 
  • Like
Likes Rainbows_
  • #14
We have 17,400 threads in the special relativity section, many of them with multiple pages. Downloading their HTML view would be many gigabytes of traffic (or even more if the script would just follow every link). They wouldn't be very useful as backup either, because they don't have all the relevant data, and they have it in a format not useful for backups.
Rainbows_ said:
You mean even in other web sites with forums.. you can't download the messages too?
I don't think any forum likes a huge amount of unnecessary extra traffic.
 
  • Like
Likes symbolipoint
  • #15
mfb said:
We have 17,400 threads in the special relativity section, many of them with multiple pages. Downloading their HTML view would be many gigabytes of traffic (or even more if the script would just follow every link). They wouldn't be very useful as backup either, because they don't have all the relevant data, and they have it in a format not useful for backups.I don't think any forum likes a huge amount of unnecessary extra traffic.

I think the following would be reasonable.

Is there any script or software where one can make the software opens each thread manually then save every page. This is not only for this physicsforums but for countless other forums sites out there?
 
  • #16
You can manually open every thread and manually save it if you like. It will take you something like a week - just for the relativity section.
 
  • #17
Rainbows_ said:
I think the following would be reasonable.

Is there any script or software where one can make the software opens each thread manually then save every page. This is not only for this physicsforums but for countless other forums sites out there?

Yes plenty of programs exist. I thought you agreed you would not do this? You would use up a good chunk of our bandwidth that we pay for.
 
  • Like
Likes Vanadium 50
  • #18
Greg Bernhardt said:
Yes plenty of programs exist. I thought you agreed you would not do this? You would use up a good chunk of our bandwidth that we pay for.

There is no software that can do this.. that's why mfb suggested to manually save it one by one for a week.
 
  • #19
I didn't suggest it. I said it is possible, but a bad idea.
 
  • #20
I think the OP should first try to contribute something to the forum rather than seeing how much he can get from it.
 
  • Like
Likes S.G. Janssens
  • #21
Charles Link said:
I think the OP should first try to contribute something to the forum rather than seeing how much he can get from it.

Yup. Anyway just install a bandwidth limiter so it can avoid any similar attempts in the future by others. I'm very poor in computers and others may be more clever to do it. And it's ok if this thread is deleted to avoid encouraging others. Thanks.
 
  • #22
Greg Bernhardt said:
Yes plenty of programs exist. I thought you agreed you would not do this? You would use up a good chunk of our bandwidth that we pay for.

btw.. just out of curiosity.. do you have certain gigabytes bandwidth allocation per month like 3 gigabytes for all access and concerned forum retrieval software can exceed that limit or is the bandwidth allocation unlimited and you are concerned only for killing the bandwidth in the sense it becomes very slow because people are downloading forums? But then in our age where 20 mbps fiber connection exist we can download gigabytes in less than 10 minutes and if this occurs at midnight where most members are asleep, the effect won't be felt.

Well. Just asking. I believe in karma and I don't want you to shoulder additional cost (or lose money) for an excellent service.

(I thought this thread would be deleted.. but it's ok too if this thread would be visible only to the participants (of this thread) or become a private conversation due to some classified data within).
 
  • #23
Most websites other than giant corporations exist on what are called server farms.
I am pretty sure that is the case with PF.
The site owner pays a monthly or something fee to rent some of that server capacity.
There isn't any politics about it, you pay the server farm for a service, and they supply it,
(unless the site breaks rules of the server farm, like porn for instance in a lot of cases, or criminal activity)
Site admins do of course have rules for their own site, but on PF I only have seen threads deleted because of crackpot nonsense.
 
Last edited:
  • #24
Rainbows_ said:
btw.. just out of curiosity.. do you have certain gigabytes bandwidth allocation per month like 3 gigabytes for all access and concerned forum retrieval software can exceed that limit or is the bandwidth allocation unlimited and you are concerned only for killing the bandwidth in the sense it becomes very slow because people are downloading forums? But then in our age where 20 mbps fiber connection exist we can download gigabytes in less than 10 minutes and if this occurs at midnight where most members are asleep, the effect won't be felt.
It's not about mbps but total bandwidth served.
 
  • #25
Rainbows_ said:
Anyway just install a bandwidth limiter...
From what I've seen, I think a bandwidth limiter is already installed.... :-p . :biggrin:

Rainbows1.jpg
 
  • Like
Likes dlgoff and Greg Bernhardt
  • #26
I thought you might...
Greg Bernhardt likes this.
I mean... I hoped you might.... :nb)
 
  • #27
Bandwidth edit 2.jpg


Hey, c'mon guys... that isn't funny.... :frown:
 
  • #28
I'd no longer save the entire web site... if it is even possible.. because I don't want Greg to lose money.

I just want to save all the messages of Arnold Neumaier because he is the most genius and talented person in the net.. the way he wrote and his mathematical equations don't seem to be written (or think up) by a mere human or harbinger of a new breed of human.. and I think he can be a Nobel Prize recipient someday. So I'll just save each of his messages.. but a script to browse the site and search/save only the messages of Neumaier would be helpful though.
 

Similar threads

Replies
8
Views
2K
Replies
7
Views
2K
Replies
17
Views
2K
Replies
66
Views
5K
Replies
9
Views
2K
Replies
10
Views
2K
Replies
22
Views
3K
Replies
147
Views
18K
Back
Top