CXL Brings Data Center-Sized Computing with Standard 3.0, Think Ahead to 4.0

A new version of a standard backed by major cloud providers and chip companies could change the way some of the world’s largest data centers and fastest supercomputers are built.

CXL-logoThe CXL Consortium on Tuesday announced a new specification called CXL 3.0, also known as Compute Express Link 3.0, which removes more bottlenecks that slow down computation in enterprise computing and data centers.

The new specification provides a communication link between chips, memory and storage in systems, and is twice as fast as its predecessor called CXL 2.0.

CXL 3.0 also has enhancements for more granular pooling and sharing of computing resources for applications such as artificial intelligence.

CXL 3.0 is about improving bandwidth and capacity, and can better provision and manage compute, memory, and storage resources, said Kurt Lender, co-chair of the CXL marketing task force, in an interview with HPC wire.

Hardware and cloud providers are merging around CXL, which has crushed other competing interconnects. This week, OpenCAPI, an IBM-backed internetworking standard, merged with the CXL consortiumfollowing in the footsteps of Gen-Z, who did the same in 2020.

CXL released the first CXL 1.0 spec in 2019 and quickly followed it up with CXL 2.0, which supported PCIe 5.0, found on a handful of chips like Intel’s Sapphire Rapids and Nvidia’s Hopper GPU.

The CXL 3.0 specification is based on PCIe 6.0, which was finished in january. CXL has a data transfer speed of up to 64 gigatransfers per second, the same as PCIe 6.0.

The CXL interconnect can link chips, storage and memory that are near and far from each other, and that allows systems vendors to build data centers as one giant system, said Nathan Brookwood, principal analyst at Insight 64.

CXL’s ability to support expansion of memory, storage and processing on a disaggregated infrastructure gives the protocol a step up from rival standards, Brookwood said.

Data center infrastructures are moving to a decoupled structure to meet the growing processing and bandwidth needs of graphics and AI applications, which require large pools of memory and storage. AI and scientific computing systems also require processors beyond CPUs, and organizations are installing AI boxes, and in some cases quantum computers, for more power.

Feature progression from CXL 1.0 to CXL 3.0 (Source: CXL Consortium)

CXL 3.0 improves bandwidth and capacity with better fabric and switching technologies, the CXL Consortium lender said.

“CXL 1.1 was more or less in the node, then with 2.0, you can expand a little more in the data center. And now you can cross racks, you can make decomposable or composable systems, with the… weaving technology that we’ve brought with CXL 3.0,” Lender said.

At the rack level, CPU or memory drawers can be made as separate systems, and enhancements in CXL 3.0 provide more flexibility and options for changing resources compared to previous CXL specifications.

Servers typically have a CPU, memory, and I/O, and may have limited physical expansion. In disaggregated infrastructure, a cable can be brought to a separate memory tray via a CXL protocol without relying on the popular DDR bus.

“You can break down or compose your data center however you want. You have the ability to move resources from one node to another, and you don’t have to do as much over-provisioning as we do today, especially with memory,” Lender said, adding that “it’s about you being able to grow systems and kind of interconnect. now through this fabric and through CXL.”

The CXL 3.0 protocol uses the electrical components of the PCI-Express 6.0 protocol, along with its I/O and memory protocols. Some improvements include support for new processors and terminals that can take advantage of the new bandwidth. CXL 2.0 had single-level switching, while 3.0 has multi-level switching, which provides more latency in the fabric.

Source: CXL Consortium

“You can actually start to look at memory as storage: you could have hot memory and cold memory, and so on. It can have different levels and apps can take advantage of that,” Lender said.

The protocol also accounts for the ever-changing infrastructure of data centers, giving more flexibility in how system administrators want to add and disaggregate processing units, memory, and storage. The new protocol opens up more channels and resources for new types of chips including SmartNICs, FPGAs, and IPUs that may require access to more memory and storage resources in data centers.

“HPC composable systems… you’re not limited by a box. HPC loves clusters these days. Y [with CXL 3.0] now you can do consistent, low-latency clustering. The growth and flexibility of those nodes is expanding rapidly,” Lender said.

The CXL 3.0 protocol can support up to 4096 nodes and has a new concept of memory sharing between different nodes. That’s an improvement from a static configuration in older CXL protocols, where memory can be partitioned and attached to different hosts, but cannot be shared once allocated.

“Now we have sharing where multiple hosts can share a memory segment. Now you can see fast and efficient data movement between hosts if you need to, or if you have an AI type application that wants to transfer data from one CPU or one host to another,” said Lender.

The new feature enables point-to-point connection between nodes and endpoints in a single domain. That sets up a wall where traffic can be isolated to move only between nodes connected to each other. That enables faster data transfer from accelerator to accelerator or from device to device, which is key to building a consistent system.

“If you think about some of the applications and then some of the GPUs and different accelerators, they want to pass information quickly and now they have to go through the CPU. With CXL 3.0, they don’t have to go through the CPU like this, but the CPU is consistent, aware of what’s going on,” Lender said.

The pooling and allocation of memory resources is managed by software called Fabric Manager. The software can be located anywhere on the system or hosts to control and allocate memory, but could ultimately affect software developers.

“If you get to the level of tiers, and when you start getting all the different latencies in the switch, that’s where there’s going to have to be some application knowledge and application tuning. I think we certainly have that capability today,” Lender said.

It could be two to four years before companies start shipping CXL 3.0 products, and CPUs will need to know about CXL 3.0, Lender said. Intel has built support for CXL 1.1 into its Sapphire Rapids chip, which is expected to begin shipping in volume later this year. The CXL 3.0 protocol is backward compatible with previous versions of the interconnection standard.

CXL products based on older protocols are slowly coming to market. S.K. Hynix this week presented its first samples of CXL (Compute Express Link) memory based on DDR5 DRAM, and will begin manufacturing CXL memory modules in volume next year. Samsung has also introduced CXL DRAM at the beginning of this year.

While products based on the CXL 1.1 and 2.0 protocols are in a two- to three-year product release cycle, CXL 3.0 products could take a little longer as they require a more complex computing environment.

“CXL 3.0 might actually be a bit slower due to the work of the Fabric Manager software. They are not simple systems when you start to get familiar with fabrics, people will want to do proof of concepts and test the technology first. It will probably be three to four years,” Lender said.

Some companies already started working on CXL 3.0 IP verification six to nine months ago and are fine-tuning the tools to the final specification, Bender said.

The CXL has a board meeting in October to discuss next steps, which could also involve CXL 4.0. The standards organization for PCIe, called the PCI-Special Interest Group, announced last month that it was planning PCIe 7.0, which increases data transfer speeds to 128 gigatransfers per second, which is twice that of PCIe 6.0.

Lender was cautious about how PCIe 7.0 could potentially fit into a next-gen CXL 4.0. CXL has its own set of I/O, memory, and cache protocols.

“CXL sits on top of PCIe electrical components, so I cannot make any commitments or absolutely guarantee that [CXL 4.0] will run on 7.0. But that’s the intent: to use electrical systems,” Lender said.

In that case, one of the principles of CXL 4.0 will be to double the bandwidth by going to PCIe 7.0, but “beyond that, everything else is going to be what we do: more fabric or different settings,” Lender said.

CXL has had a fast pace, with three standard releases since its formation in 2019. There was confusion in the industry about the best high-speed, coherent I/O bus, but now the focus has coagulated around CXL.

“Now we have the fabric. There are pieces of Gen-Z and OpenCAPI that aren’t even in CXL 3.0, so will we incorporate them? Sure, we’ll see do that kind of work in the future,” Lender said.

Leave a Comment