A couple of things:
For any commercial product, typically there is no difference in the actual silicon or its design between commercial, industrial and military parts. The only difference is where the parts fell out (binned) at the end-of-line quality testing and what the product demand is for any given grade.
And example: when I worked for Intel we had some odd-ball distributions on the parts I worked on. To maximize profits, they invented a new "product grade" that had performance between commercial and industrial ratings because they were getting unusually high fall-outs in testing between these two. They charged extra above commercial but less than industrial for it.
Because you have to spend time and money to do the more extensive testing of higher grades, having a fall out "just before" the qualification for the "next higher" grade cost you money that you never recover (unless you play this trick) - you can sell the parts as the next lower grade because the still exceed that qualification but you've invested as much as the higher grade parts. So they had something called "Express" class for a while which were all those less-than-industrial fall out parts.
For military parts that came from commercial part lines and are tested out, this is called QPL or Qualified Parts List testing. Basically if you can pass the "eye of the needle" of end-of-line testing, you have a mil spec part. This works for man-rated stuff which are generally lower environmental stress (yes, 125C is low stress - we typically do reliability testing at 200C-300C) but generally not how you qualify for "Class S" (strategic/space rated) mil spec.
A classic example of how QPL can fail was a case I "FA-ed" years ago involving a design flaw in a system that accelerated the parts so much that every single one reached end-of-life as a result of the mil qualification testing combining with the flaw. It was discovered just before use but the only work around for aged parts was to never use them. Cost a pretty penny.
So instead you have to qualify the process akin to an ISO qualification (aka QML or Qualified Manufacturer List) - you have to prove you continuously control the processes at each step, not just that you can pass end-of-line testing. This is akin to the parametric process control that is pretty common on commercial lines but more stringent; there is typically more in-depth testing including using reliability testing as a process control parameter. This is where WLR (wafer level reliability) came from originally.
A lot of this parametric device test involves accelerated life testing both on-wafer (WLR) and packaged part (PLR). That's actually how we got our start at
http://www.corewafer.com/" - we were on the teams involved in developing these standards and test methods (e.g. the JEDEC WLR test methods). We were involved in military class S design, quality and reliability testing. The Berlin Wall fell, we suddenly needed new jobs and the government sold us all the IP we'd been working with.
The confidence level doesn't improve with temperature. For hard failures, variance is generally worse with temperature because 1) the failures can become more non-centrally distributed (e.g. scale-free distributions have been used for end-of-life modeling with some success) and 2) the failure mechanism tends to change with greater acceleration.
Think of a piece of paper - a little heat, like in a desert, will accelerate aging slowly because the oxidation rate is gradually accelerated. Classic Arrhenius acceleration. Increase the temperature beyond, say, 451F (kindling temperature of paper), the aging and failure mechanism changes radically - you are burning - still oxidation but not comparable to the slower process (though often correlatable), and also now it's an "active", "feedback-driven" form of oxidation whose speed and time-to-completion (to failure) can vary radically with small parameter changes ("flap of a butterfly wing creates a hurricane").
Only adding more parts (and more points of stress acceleration) improves statistical confidence. This is part of why we have parallel test products - test more parts in parallel gives you greater statistical significance in the same test-time interval and at different stress levels. Better data due to larger numbers and due to more curve fit points for the life-time curve (each stress condition only provides a single point for the fit - which is an exponential fit minimally requiring at least 3 points over 3 orders of magnitude time).