Coin said:
Is the CGI talking to a database server, like mysql or something?
The CGIs are doing very little. They make a connection (through an internal firewall) to a Perl server on a separate machine. We don't believe the internal firewall is overloaded, and we know the Perl server(s) similarly aren't overloaded. We also know the Apache "MaxClients" isn't being hit (or whatever it's called in Apache-- I'm forgetting the exact name).
Coin said:
Can you replicate the bad behavior on demand, say by simulating heavy load? If so, does the problem go away if you temporarily switch to HTTP?
We can't replicate the problem. We've tried various browsers (clients with problems have been using MSIE 6,7,8, which we've tested), both inside and outside our network. Some clients complain about it, but most don't.
We can't really experiment, which is unfortunate-- although we're trying to coordinate with clients to figure it out.
Coin said:
What you are describing sounds like a pretty normal LAMP problem. Probably you are running out of some resource such as RAM or database client connections.
If it were a resource issue, I would expect that our internal staff (and other clients) would have similar problems during heavy loads. But it seems that
particular clients are repeatedly having issues. It's very unclear at the moment, since details from customers are sketchy. A few of them HAVE suddenly "gotten better", but some are constantly experiencing issues.
We've been running the site for roughly 12 years, we've had resource issues before which we could track, but this one seems very different. Recently (on the 11th), we changed:
1) The domain name. Shouldn't be a big deal.
2) The certificate. We now use TrustWave rather than Verisign. It's also a 256-byte key rather than a 128-byte key.
3) Some of the page references. Used to be that ALL the content was loaded from a single domain. Now, some of it is loaded from a SEPARATE domain (albeit it's all on the same actual webserver)
There's other things behind the scenes that we've changed (like which MySQL database we use, etc), but we can verify that those AREN'T causing the problem. The problem is visible before the MySQL connections are even established or used, and before any actual processing is done-- by the time the connection is received, the problem's already happening.
Honestly, I don't think any of the things that have changed ought to be causing a slowdown-- or, if they do, it shouldn't be taking 30+ seconds. Maybe a fraction of a second for the larger encryption, or some sort of strange browser config that raises security flags that we've got 2 different domains in the same page. So, I'm grasping at straws.
I will say that we've had various unexpected problems with TrustWave certificates. Some clients don't recognize them by default, and some automated software (Java mostly) similarly didn't recognize them and outrightly failed on page requests to us. Hence, my distrust of the certificate, and my suspicion that it could be HTTPS related. And given that I don't really understand the details of HTTPS, I'm curious what steps are involved so we can identify them.
For instance, from what I gather, Apache recognizes a particular IP/port combination for an HTTPS key. If our client has 2 computers on the same network, and they're going after our website, they'll come across as the same IP, but (I think?) different dynamic ports. However, they'll recycle them at some point-- IIRC, ports only go up to 2^15 or so? So, if one computer logs on and gets one HTTPS key at 13:44 using port 12345, but then their buddy logs on at 13:48, and ALSO gets the SAME recycled port of 12345, then our Apache server can't tell the difference between them, and sends the incorrect HTTPS key. Now, I expect that their network guys do something to make sure this doesn't happen-- but I have no idea.
Similarly, how does Apache store these IP/port keys? Is there a limit to the number that it stores, or a timeout associated with them? Does it delete them on a client signal? (If so, what signal?) Could it be that since we now are hosting on an increased number of domains on the same webserver (same amount of TRAFFIC, mind you, just now diversified to multiple domains) that we're hitting some boundary on IP/port keys? Does the IP/port key stored differently depending on the domain that the client entered? (It darn well better, I guess!)
Anyway, I'm not really sure where to go at the moment-- the certificate seems like the most LIKELY candidate, but really, nothing I can think of OUGHT to be causing the problem. Each possibility I can come up with either doesn't fit, or doesn't seem likely in terms of causing a 30+ second delay. 30+ second delays are typically resource issues (waiting for an available slot or outright timing out), but there's nothing I can find on our end that would seem to indicate a resource issue. And of course, we've never observed any problems, and neither have most clients, so it makes me inclined to think it's a resource issue on THEIR end, but I still don't see how.
DaveE