![]() ![]() The new version of y-cruncher adds a lot of performance enhancements. If we assume that the fault-tolerance accounts for 2x of the 5x improvement, then the remaining 2.5x is from performance improvements. This includes the time lost to hard drive failures as well as the time used to perform preemptive backups of the checkpoints. There were a total of about 12 days with no useful computational work. ![]() So this final failure added about a week to the time that was lost. The radix conversion lacks checkpoints due to the in-place nature of the algorithm. The 5th and last failure happened during the verification of the radix conversion. In this computation, there were 5 hard drive failures - 4 of which wasted very little computation time. ![]() This boost in reliability enabled us to sail through the 12.1 trillion digit computation without too many issues. The result is that those 6-week long gaps were reduced to a mere hours - or at most a few days. This was done by adding the ability to save and restore the execution stack. Y-cruncher v0.6.2 solves this problem by completely rewriting and redesigning the checkpoint-restart system to allow for finer-grained checkpointing. Ultimately, it was basically a disaster since it nearly doubled the time needed to compute 10 trillion digits of Pi. And this hit us hard: Several of the largest work units required many attempts to get through. You basically get a downward spiral in efficiency. And when the MTBF becomes shorter than the work unit, bad things happen. While it's easy to keep a computer running for 6 weeks, the MTBF of the 24 hard drive array was only about 2 weeks. Meaning that: In order to make progress, the program had to run uninterrupted for up to 6 weeks at a time. In the 10 trillion computation, some of the checkpoints were as spread out as 6 weeks apart. The longer the computation, the longer the time between the checkpoints. The problem was that it only supported a linear computational model and thus restricted a computation a small number of checkpoints. But this original implementation turned out to be horrifically insufficient for the 10 trillion digit run. Y-cruncher has had checkpoint-restart capability since v0.5.4 where it was added specifically for our 5 trillion digit computation. This is the ability to save the state of the computation so that it can be resumed later if the computation gets interrupted. A factor of 2 comes from just eliminating the impact of hard drive failures. Numerous performance optimizations to the software.įault-tolerance is the most obvious improvement.180 days -> 12 days lost to hardware failures. Much improved fault-tolerance in the software.Better Hardware: 12 core Westmere -> 16 core Sandy Bridge.The reason for this was a combination of several things: So it comes out to about 5x improvement in computational speed - which is a lot for only 2 years. This time we were able to achieve 12.1 trillion digits in "only" 94 days. Our previous computation of 10 trillion digits required 371 days. The verification was done using the same BBP program from before. The main computation was done using a new version of y-cruncher (version v0.6.2). The hexadecimal digits of the main computation matched that of the BBP formula: Modular hash checks were used to verify the final multiply and the radix conversion.The verification was done using the BBP formulas.The computation was done using the Chudnovsky formula.The math and overall methods were the same as our previous computations: This time he had a new machine based on the Intel Sandy Bridge processor line.Ģ x Intel Xeon E5-2690 2.9 GHz - (16 physical cores, 32 hyperthreaded)ġ28 GB DDR3 1600 MHz - 8 x 16 GB - 8 channels)Īnd of course the obligatory pictures. Start: 10:15 PM (JST), September 25, 2013Īs with the previous computations, Shigeru Kondo built and maintained the hardware.More disk space was needed to make backups of the checkpoints.9.20 TiB of disk was needed to store the compressed digit output.60.2 TiB of disk was needed to perform the computation.Furthermore, a bug was discovered in the software which had to be fixed before the computation could be continued. The computation was interrupted 5 times - all due to hard drive errors. Unlike the grueling 371 days that was needed for our previous computation of 10 trilllion digits, this time we were able to beat it in about 94 days. Computation Statistics - All times are Japan Standard Time (JST). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |