Frontier Supercomputer Hardware Failure, Is It A Technical Issue!

Testing on the cutting-edge Frontier supercomputer has been plagued by “daily hardware failures.” Hardware failures occur daily during testing on the upcoming exascale Frontier supercomputer at Oak Ridge National Laboratory (ORNL).

Frontier is the first Cray supercomputer to feature the upcoming AMD Epyc CPUs and Radeon Instinct GPUs, and it is based on Cray’s new Shasta architecture and Slingshot interconnect, both of which were announced in 2019. After the system is fully functional and available to researchers, they anticipate a performance of 1.5 exaflops.

Although China is suspected to be operating a number of exascale systems it hasn’t entered onto the Top500 list, the system is officially the world’s fastest supercomputer and the first to break the exascale barrier.

Current issues with the system, as reported by InsideHPC, appear to revolve around Frontier’s stability when running extremely taxing workloads, with some of the issues centered on AMD’s Instinct GPU accelerators, which handle the bulk of the system’s processing workload and are paired with AMD Epyc CPUs within the system’s blades.

The publication has previously reported that Frontier’s HPE Cray Slingshot fabric had issues beginning in the fall of 2017 and continuing into the spring of 2018. Oak Ridge Leadership Computing Facility (OLCF) program director Justin Whitt said the problems are typical of those seen in the past when tuning and testing supercomputers at the facility.

You’re bound to experience hardware failures at this scale, so “we’re working through issues in hardware and making sure that we understand” what those are. “The mean time between failure on a system of this size is hours, not days. You need to make sure you understand what those failures are and that there is no pattern to those failures that you need to worry about.”

The problems we’re having are mostly associated with running extremely large jobs that tax the system to its limits, as explained by one of our developers: “It’s mostly issues of scale coupled with the breadth of applications.” “and having all the hardware cooperate to make that happen,” Whitt elaborated. This is akin to a capstone project for supercomputers. This is the trickiest part to get to.

According to Whitt, “it would be outstanding” if the system could go without crashing for an entire day. “We’re not super far off target,” the spokesperson said, adding that “our goal is still hours” which is longer than Frontier’s current failure rate. The problems are not limited to graphics processing units but affect many other areas as well.

I don’t think we have much to worry about with AMD products right now. It’s not too out of the ordinary; we’re just dealing with the same sorts of startup issues that have plagued our other deployed machines.

A total of seventy-four cabinets, each weighing in at a hefty eight thousand pounds, make up the supercomputer. There are a grand total of 37,632 GPUs spread across their 9,408 HPE Cray EX nodes, each of which is powered by a single AMD “Trento” 7A53 Epyc CPU and four AMD Instinct MI250X GPUs. The total number of processors in the system is 8,730,112. The peak power consumption for the supercomputer is 40MW and it occupies an area of 372 square meters (4,004 square feet).

Aurora, which also began installation late last year, will come after Frontier. Equipped with Intel’s Ponte Vecchio GPU, its 2 exaflop performance is ambitious. The El Capitan supercomputer, powered by AMD, is expected to enter service in 2017. This system will also have a performance of 2 exaflops.