Published on
Feb 13, 2024

Our Perspective on Data Centers and CXL

Compute Express Link (CXL) & The Future of Data Centers. Data centers are large buildings full of networked computers that represent the fundamental infrastructure layer of our world’s digital economy.

[
Research & Primers
]

Data Centers Today

Data centers are large buildings full of networked computers that represent the fundamental infrastructure layer of our world’s digital economy. Often located in the outskirts of cities, these robust facilities provide large-scale organizations with the compute and networking power required to store and process their data and run their applications. Since data centers contain an organization's most critical and proprietary assets, it is imperative that they are secure, reliable, and efficient. Like a center on a football team, data centers are often overlooked relative to how vital they are to a high-functioning offense. However, if you observe closely, you realize their pivotal role as the backbone of the offense and that without them, the team (in this case, our economy) cannot efficiently operate. 

As large language models (LLM) continue to take the world by storm and drive large and small enterprises alike towards mass AI adoption, unprecedented demand is being created for compute resources in the process. In addition to accommodating our growing data/internet streaming needs, data centers must also onboard the training and inference of various multi-trillion parameter foundational models. According to a McKinsey study, demand in the US market alone—measured by power consumption to reflect the number of servers a data center can house—is expected to reach 35 gigawatts (GW) by 2030, up from 17 GW in 2022. This mass data center buildout will be one of the most critical infrastructure development projects in recent U.S. history.

The Case for a Cohort of New Cloud Leaders

Legacy data centers will eventually be rendered obsolete, as CPU-native servers are incapable of supporting the constant accelerated workloads for LLMs while also complying with uptime/downtime requirements, water usage, power sourcing, and satisfying strict service level agreements (SLAs) from buyers–an entirely new server architecture will be required across all data center tiers.

Our investment in CoreWeave last year was grounded on this very premise, that there will be a new leader in cloud services by way of the AI revolution. The existing data center infrastructure will need to be reconfigured to accommodate th e power and compute requirements of the future, but legacy operators will struggle to adapt quickly due to the complexity and cost of retrofitting. CoreWeave, through their innovative product architecture, first-mover advantage, signed data center capacity, and myriad of strategic relationships (Nvidia, Microsoft, etc.) has already become the largest accelerated compute provider in the United States. 

A screenshot of a graphDescription automatically generated

As hyperscalers gradually retrofit their data centers to match today’s need, there will be opportunities for new entrants that offer creative, efficient accelerated cloud offerings in the interim, like CoreWeave, Lambda Labs, Together AI and The San Francisco Computer Company.

Memory as a Roadblock to Data Center Server Improvement

Despite the emergence of next generation data center providers like CoreWeave and Lambda Labs, most of the landscape—including the hyperscalers—is still in the process of adapting. The core components of data centers are (1) servers & networking, (2) storage, (3) cooling, and (4) uninterruptable power sources (UPS). While each of these components needs to see step-change improvements, perhaps the most important component is the server architecture itself. 

Bleeding edge server architecture today revolves around the integration of Nvidia’s flagship H100 graphics processing units (GPUs). Since these chips are so performant relative to competition, Nvidia has become the industry standard chip manufacturer for this stage of AI. To fully utilize these high-performance chips, the ancillary components within servers like the CPU, accelerator, memory, storage, and network interfaces must also be upgraded. Improving server performance today means solving the significant memory issues that data centers face today:

  1. Large latency gap between DRAM and solid-state drive (SSD) storage – If a processor exhausts its memory capacity, it must then wait to retrieve the information from the SSD. This time that the processer is spent idle significantly throttles performance.
  2. Core counts in multi-core processers are scaling quicker than main memory channels – Once core count reaches memory capacity, each additional core would be underutilized.
  3. Growth of Accelerators with attached DRAM – This translates to more stranded memory resources.

This is not so easy, as each component (CPU, Memory, Storage, Accelerator, Interconnect, etc.) is produced by a different manufacturer and cannot be fully optimized out-of-box. 

A case study that speaks to the benefit of standardization in computer hardware is Apple’s transition away from Intel-powered chips in favor of their own dedicated silicon for their devices. With their debut M-series system on a chip (SoC) working with their in-house flash storage and multi-core CPUs, Apple was able to show a 100%+ performance improvement compared to the previous generation Intel-powered Macbooks [see graphic below]. This was a step change improvement, rather than the incremental improvement that we were used to seeing in laptops. 

In the context of data center servers, the near-term solution would be for leading manufacturers to agree on server architecture standard where each component can communicate with one another without friction – something modular, scalable, and relatively “future proofed.” Thankfully, such a standard exists for server interconnects called CXL, which is a step in the right direction. 

Understanding Compute Express Link (CXL)

Compute Express Link (CXL) was formed in 2019 by a consortium of leading computer component manufacturers including Intel, Google, Cisco, Nvidia, AMD, and Microsoft. The goal was to develop an open interconnect standard where processors, expanded memory, and accelerators could communicate with low latency and maintain memory coherence even within heterogenous system architectures. In a server with CXL interconnects, the host CPU and the external devices effectively share each other’s memory – this solves the previously highlighted issue of exhausted memory cards on today’s servers.

CXL builds on top of the existing Peripheral Component Interconnect Express (PCIe) standard that is mostly used in the industry today and extends its capabilities to three main protocols—the combination of these protocols allows for every component in the server to utilize each other’s memory and even expand memory capacity if needed. 

  1. CXL.io: Similar to PCIe, the existing standard interface for motherboard components.
  2. CXL.cache: Protocol enabling accelerators to access host CPU’s memory for added performance.
  3. CXL.memory: Allows host CPU to access device attached memory.

The illustrations below show how these protocols work together to facilitate memory sharing.

A diagram of a deviceDescription automatically generated
A whiteboard with text and wordsDescription automatically generated
A close-up of a computer errorDescription automatically generated

The Evolution of CXL Interconnect Standards

Since 2019, Compute Express Link has evolved quite a bit, introducing step change improvements in speed and functionality with every new update. Below, is a summary:

CXL 1.0/1.1

  • Allows for only one host CPU.
  • Leverages PCIe 5.0 physical & electrical interface.
  • Allows data transfers at 32 GT/s in each direction over 16-lane link.
  • Level 1 devices can only be utilized by one host processes at a time.

CXL 2.0

  • CXL level 2 switches that allow 16 CPUs to simultaneously access all memory in the system.
  • Level 2 devices can be utilized by multiple host processes at once.

CXL 3.1

  • Leverages PCIe 6.1 physical & electrical interface.
  • Increases data transfers speeds to 64 GT/s in each direction over a 16-lane link.
  • Peer to peer memory access—devices can communicate with each other without involving the host CPU.
  • Allows memory allocations to be dynamically reconfigured without having to reboot the host CPU.

Summarizing the Benefits of CXL-enabled servers.

  1. Enables low-latency connectivity between all server components.
  2. Cache coherency ensures that the host processor and CXL devices (GPU, FGPA, storage, smartNIC) all see the same data.
  3. Allows host processor to access expandable memory cards if at capacity.
  4. Created the “As Needed” Memory Paradigm, where all memory in a CXL-enabled system is fully utilized by the underlying host processors.
  5. All three CXL protocols are secured via Integrity and Data Encryption (IDE) which provides confidentiality, integrity, and replay protection.
A diagram of a computer networkDescription automatically generated

We believe that CXL Interconnect standards will play a pivotal role in facilitating many more performance and efficiency breakthroughs for data centers moving forward. Standardizing cache coherent interconnects enables each manufacturer on the value chain to fully optimize their product without having to worry about its compatibility with other components. The benefits of the CXL interconnect standard are clear – its rapid development (3 generations since 2019) and industry-wide adoption is a testament to the tremendous value it has delivered so far.

Final Thoughts

MNH Thesis: The future of data centers will rely on cloud-based, modular infrastructure with more powerful CPUs, Accelerators (GPUs), and memory cards that are stitched together with CXL interconnects within the servers and Infiniband interconnects across servers. 

We are keen on keeping a pulse on any developments to the CXL standard. As compute requirements exponentially increase in the future, there will be more pressure for incumbents and new entrants alike to build clever new server components that take advantage of CXL interconnects. If you are a founder building switches, chips, or anything related to data center servers, please reach out to us. Our deep relationships with the largest tech manufacturers in the world (Samsung, SK, Toyota, and more) put us in a unique position to accelerate your path to commercialization.

Sources

Ready to get your electric future set?