Ultra Ethernet Consortium publishes 1.0 specification, readies Ethernet for HPC, AI

In mid-2023, a host of big name networking vendors including Cisco, Arista, HPE and Intel got together to form the Ultra Ethernet Consortium (UEC). The goal is to make Ethernet better for the needs of AI and high-performance computing (HPC).

Now, nearly two years later, the UEC is delivering on its initial promise with the release of the first UEC 1.0 specification. The specification details enhancements to Ethernet that improve low-latency transport in high-throughput networking deployments. It includes a modern Remote Direct Memory Access (RDMA) approach, direct memory access implementations, transport protocols, and congestion control mechanisms.

“Originally, the initial companies that were looking to create an open approach to Ethernet were primarily focused on HPC, because at the time, that was where the gold standard for high performance networking was needed,” J Metz, chair of the steering committee, Ultra Ethernet Consortium, told Network World. “Only a few months after we formalized and founded, though, ChatGPT changed the world.”

Metz noted that fortunately, UEC was prepared for it as AI was always part of the plan. However, it also made the group realize just how important the work it was doing was going to be.

“HPC has limited appeal; AI has widespread interest and focus,” he said. “It truly was the biggest shift from the moment we started to the moment we published.”

The significance of the 1.0 spec

 A 1.0 version of a specification typically indicates a degree of stability that organizations and implementers will be able to rely upon.

Metz said that from his perspective, 1.0 is more than just a version of a standard doc. In his view it’s a milestone, because a group of companies and organizations have taken a full-stack approach to synchronizing a network to workload requirements. 

“Ethernet, a fantastic, general-purpose network designed to be as flexible as possible for as many different types of workloads as you can throw at it, was always assumed to be insufficient for the most demanding workloads because its flexibility worked against it,” Metz said. “The truth of the matter is that – as we found out in UEC – tuning Ethernet for specific workload requirements is hard. Hard, but not impossible.”

Tuning Ethernet requires knowing how and when to break the rule, especially around network layers. Metz said that UEC solves the challenge with open standards. It solves the layer violations with coordination across the layer workgroups, and it solves the problem of being a “net-new” protocol by working closely with ecosystem industry partners (such as SNIA, OCP, IEEE, DMTF, NVM Express).

“So, it’s more than just coming up with a specification, but rather developing a long-term framework for allowing end users the confidence that deploying UEC is not a one-off, isolated plan of action,” Metz said.

Congestion control at the core of UEC 

Among the key areas of innovation in the UEC 1.0 specification is a new mechanism for network congestion control, which is critical for AI workloads.

Metz explained that the UEC’s approach to congestion control does not rely on a lossless network as has traditionally been the case. It also introduces a new mode of operation where the receiver is able to limit sender transmissions as opposed to being passive. 

“This is critical for AI workloads as these primitives enable the construction of larger networks with better efficiency,” he said. “It’s a crucial element of reducing training and inference time.”

Ethernet vs. Infiniband: Is UEC the power boost?

Infiniband has often been regarded as superior to Ethernet when it comes to HPC and AI, as it has better performance characteristics for those workloads. In many respects, UEC will level the playing field dramatically across the two rivals.

Metz noted that UEC takes a workload semantic approach.

“In a nutshell, that means that we take the workload and define the characteristics of the network that are necessary to tune the delivery of packets without requiring changes in the applications themselves,” he said. “Identifying the semantic requirements then turns into adjustments into the packet delivery system, and that in turn leads to the congestion requirements, the security requirements, the delivery ordering requirements, etc.”

On top of all that, he explained that UEC creates an environment in the network where the fabric end points aren’t just hardware-bound into a NIC port. Instead, UEC allows a major new capability that takes advantage of all paths in a network. State is only maintained for as long as a transaction exists, which reduces the memory requirements and does not require new switching infrastructures.

Vendors embrace UEC

Over the last two years, UEC support has expanded with a growing number of networking vendors.

Among the original supporters is Arista Networks. Hugh Holbrook, chief development officer for Arista Networks, told Network World that from his perspective, the key deliverable of the 1.0 release is the specification of the new transport protocol. He noted that it is designed for future-looking AI and HPC requirements, including low tail latency, fast startup time, modern congestion control, and encryption. 

From a product perspective, Martin Hull, vice president and general manager of cloud and AI platforms at Arista, told Network World that his company’s portfolio is ready for the UEC 1.0 spec. 

“Arista will be supporting the UE 1.0 switching enhancements across our portfolio of Etherlink products, starting with the 7060X and 7800R initially,” Hull said.

Juniper Networks is also supporting the UEC effort. Amit Sanyal, head of data center product marketing at Juniper Networks, told Network World that Juniper is particularly excited about the UEC 1.0 specification’s ability to enable packet spraying at the switch level and reordering at the NIC. 

“This approach significantly improves network utilization using an open, standards-based method—capabilities that, until now, were only available in proprietary and closed systems,” Sanyal said.

In terms of deployment, Sanyal said that Juniper is partnering with AMD on a jointly validated design that brings together Juniper’s high-performance switches with the UEC-ready AMD Pollara NIC.

What’s next for UEC 

According to Metz, UEC is just getting started. 

Metz said that four workgroups got started after the main 1.0 work began, each with their own initiatives that solidify and simplify deploying UEC. These workgroups include: storage, management, compliance and performance. He noted that all of these workgroups have projects that are being developed to strengthen the ease-of-use, efficiency improvements in the next stages and simplified provisioning.

UEC is also working on educational materials to help inform networking administrators on UEC technology and concepts. The group is also working industry ecosystem partners. 

“We have projects with OCP, NVM Express, SNIA, and more – with many more on the way to work on each layer – from the physical to the software,” Metz said. “We have no desire to attempt to be all things for all people, and are working with experts around the industry to solve those problems together.”

Total
0
Shares
Previous Post

Oracle’s struggle with capacity meant they made the difficult but responsible decisions

Next Post

AMD steps up AI competition with Instinct MI350 chips, rack-scale platform