Aiming to become the global leader in chip-scale photonic solutions by deploying Optical Interposer technology to enable the seamless integration of electronics and photonics for a broad range of vertical market applications

Free
Message: More information on LIGHTMATTER NextPlatform article One Laser To Pump Up AI Interconnect Bandwidth By 10X

More information on LIGHTMATTER NextPlatform article One Laser To Pump Up AI Interconnect Bandwidth By 10X

posted on Oct 17, 2024 12:12AM

https://www.nextplatform.com/2024/10/16/one-laser-to-pump-up-ai-interconnect-bandwidth-by-10x/

(images at link)

According to rumors, Nvidia is not expected to deliver optical interconnects for its GPU memory-lashing NVLink protocol until the “Rubin Ultra” GPU compute engine in 2027. And what that means is that everyone designing an accelerator – particularly the ones being designed in house by the hyperscalers and cloud builders – is hoping to one up Nvidia in AI compute by deploying optical interconnects ahead of Big Green to give them an edge.

The demand for optical interconnects is so high, given the immense bandwidth bottlenecks for accelerator-to-accelerator and accelerator to memory needs, that raising venture funding is not a problem. We are seeing more and more action on this front, and today we are going to talk about Xscape Photonics, an optical interconnect startup that is spinning out of research at Columbia University.

 

Columbia University, whether you know it or not, is a hotbed for interconnects and photonics.

Professors Al Gara and Norman Christ built a DSP-powered supercomputer with a proprietary interconnect to run quantum chromodynamics applications, which won the Gorden Bell Prize in 1998. This QCDSP system research laid the foundation for IBM’s BlueGene massively parallel supercomputers, of which Gara was the chief architect. (Gara moved to Intel and was also the architect of its presumed successor, the “Aurora” supercomputer at Argonne National Laboratory.)

A whole different group of researchers at Columbia work on silicon photonics, and many of them have teamed up to create Xscape Photonics. Keren Bergman, who the runs the Lightwave Research Lab at the university, is one of the company’s co-founders and has been using photonics to drive down the energy to move data in systems. Co-founder Alex Gaeta, who is president of the startup and who was originally its chief executive officer, did foundational work in quantum and non-linear photonics, namely parametric amplifiers and optical frequency comb generators. Co-founder Michal Lipson invented some of the key photonics components, such as microring modulators and nanotaper couplers. Co-founder Yoshi Okawachi is an expert in a special kind of laser called an optical frequency comb.

 

When these Columbia researchers decided to go commercial with their optical interconnect ideas, it is interesting that they chose Vivek Raghunathan, one of the co-founders and not from Columbia, chief executive officer at Xscape after Gaeta decided to scale back his responsibilities to return to his professorship at the university.

Raghunathan hails from MIT, where he got his degrees in materials science and engineering and where some peers at Xscape spent some time as well; he did six yeas as a research assistant at MIT on various silicon photonics projects and joined Intel in 2013 as a senior packaging R&D engineer at its foundry in Chandler, Arizona. Raghunathan rose through the ranks and led the creation of Intel’s first 100 Gb/sec Ethernet optical transceivers as well as working on its GPU-to-HBM interconnects. Raghunathan did a stint as an engineer at Rockley Photonics and then joined Broadcom in 2019, where he was the leader of the “Humbolt” co-packaged optics for the 25.6 Tb/sec variant of the Tomahawk 4 switch ASIC, which was deployed by Tencent and ByteDance in China. Raghunathan started the “Bailly” follow-on CPO project for the 52.6 Tb/sec Tomahawk 5 generation, but left before it was finished to join Xscape.

The big news this week is that Xscape raised $44 million in its Series A venture rounding, following a $13 million seed round after its founding in 2022. The financing was led by IAG Capital Partners. Interestingly, Altair, the creator of the HyperWorks computer-aided engineering tool, is an investor, and one of its founders, an alum of Columbia as well, is on the board for the engineering school. Cisco Investments, Fathom Fund, Kyra Ventures, LifeX Ventures, Nvidia, and Osage University Partners.

The Nvidia investment is interesting given the need to connect larger numbers of GPUs to each other than what the GPU giant has been able to do with its copper-based NVLink-NVSwitch interconnects in the GB200 NVL72 rackscale system announced in March using the “Blackwell” B100 GPU accelerators. With light pipes between the GPUs and possibly their memories, Nvidia could literally turn a datacenter into one giant, virtual GPU. And you can bet that this is precisely what Nvidia wants to do, and has hinted at with its NVSwitch with CPO concept design back in 2022.

The problem with AI accelerators, no matter what their architecture, is that once you go outside of the edges of a given device, the bandwidth between compute elements or memory starts to taper off, and rather quickly.

For any accelerator, the need to put HBM stacked memory very close to the compute engine using electrical signals means you can only pack so much in a given circumference around that chip.(And you can only stack the memory so high to increase capacity, and even when you do, that does not increase bandwidth. Only faster memory and more memory ports can increase bandwidth. And because HBM is expensive and in short supply, we see the GPU accelerator roadmaps doing weird things to match the limited memory capacity and bandwidth that they can get against the sometimes overpowered GPUs that they have.

 

What is all comes down to, says Raghunathan, is the “escape velocity” of data coming out of an accelerator, which is where the name Xscape Photonics comes from. (Don’t get too literal with this.)

These are numbers we talk about often, but it is nice to have them all in one place to show the tapering, which is on the order of 160X when you look at a cluster of Nvidia GB200 hybrids, which have two “Blackwell” GB100 GPU accelerator for every one “Grace” CG100 Arm server CPU. This tapering in bandwidth is one measuring one of those GPUs against a 400 Gb/sec Quantum 2 InfiniBand port, which is commonly used to let a GPU talk to other GPUs in the cluster and outside of its own node.

So what is the effect of this bandwidth tapering, which means data cannot move into and out of the GPU fast enough? Low utilization on very expensive devices.

 

For AI training and inference, Raghunathan cites data from Alexis Bjorlin, who used to run infrastructure for Meta Platforms but who has moved over to be general manager of the DGX Cloud at Nvidia. Take a look:

“For training, as a consequence, it happens that most of the GPUs as you continue scaling, the problem has shifted from GPU device level performance to a system level networking problem,” Raghunathan tells The Next Platform. “Depending upon the workload, you end up spending a lot of time in communication between the GPUs. In the chart that Meta showed, they talk about certain workloads where almost 60 percent of the time is being spent in networking. Similarly, when you’re thinking about inference, you’re looking at state of the art GPUs with anywhere between 30 percent to 40 percent utilization doing a ChatGPT search. This low GPU utilization is the fundamental problem that our customers want to solve as they continue buying billions and billions and billions of dollars of GPUs.”

This math is simple. At 50 percent utilization, a percentage of peak compute that is preordained and proscribed by the limited bandwidth into and out of the GPU, that means the GPU costs twice as much as you think it does and that means you wasted half of the money.

Now, to be fair, we doubt very much that CPU utilization averaged across the whole world is any higher than 50 percent. But the average CPU doesn’t cost $30,000, either. It is probably closer to $1,000 averaged across some 15 million servers a year. But that is still tens of billions of dollars a year that goes up the inefficiency chimney. The waste for GPUs is more than an order of magnitude more “lost” money, which is why everybody is freaking out.

“It is this tapering of the bandwidth that we really want to solve at Xscape Photonics,” says Raghunathan, echoing many of the comments we have heard from Ayar Labs, Lightmatter, Eliyan, Celestial AI, the members of the Ultra Accelerator Link consortium, and others. “And how do we solve that? We think converting all the electrical signals that escape out of the GPU directly to optical signals in the same package, and maximizing that as we connect a pool of GPUs and memory together, is the most cost efficient and energy efficient way of like scaling the GPU performance.”

The trick that the Xscape team has come up with is a laser that can drive multiple wavelengths at the same time out of a fiber – like up to 128 different colors, potentially representing a factor of 32X higher bandwidth than is available with lasers used in optical interconnects that drive four different colors. Moreover, says Raghunathan, the Xscape approach with its ChromX platform will use simpler modulation schemes, like NRZ, that do not impact latency the way that higher order modulation schemes like PAM-4 do and that have been used to boost the bandwidth of InfiniBand and Ethernet in recent years.

 

Perhaps equally importantly, the ChromX photonics platform is programmable, so the number of wavelengths provided matches the need of a particular AI training or inference workload and the connectivity needs between accelerators and their HBM memory, all within a switched fabric infrastructure. The programmable laser is coming out first, and here is the concept:

This chart shows on the left the four wavelengths needed out of a laser used for a CWDM4 transceiver to create the interconnect for an AI training cluster.

In the middle are the four different wavelengths needed to make an LR4 optical transceiver that is commonly used when you have to use optical links to span two datacenters and synchronously link them so training can occur over both as if they were one larger datacenter.

And on the right is an inference engine that has a switched accelerator and HBM memory complex that is quite a bit different from what Nvidia has done with NVLink and NVSwitch and has sixteen different wavelengths.

The different wavelengths correspond to the expected distances between the devices. Training is usually 2 kilometers or less between devices, according to Raghunathan, and the cross-datacenter edge use case is expected to be between 20 kilometers and 40 kilometers, but some people are talking about 10 kilometers to 20 kilometers. Inference has more wavelengths and the distances between devices is expected to be between 10 meters and 200 meter, and require much more bandwidth to make those devices hum at high efficiency.

 

This latter bit, which is related to a disaggregated fabric architecture for compute and memory, and which we think is valid for both training and inference, is interesting. So let’s look at that:

With this architecture, the HBM memory is not attached to the GPUs, but is glued together in banks that would probably be implemented on physically distinct shelves in a rack or across entire racks. The GPUs – or indeed any kind of AI or HPC accelerator – are banked together so they can share local data in their caches in a coherency domain but all of them are linked through a switch that cross-connects the accelerator pools with the memory pools. Each one of those pipes above is a optical link, and its properties can be programmed by the ChromX platform, with the right number of wavelengths at the right frequencies to meet bandwidth and distance (and therefore latency) requirements.

“Our technology pretty much unlocks that cost barrier and scale barrier and makes it extremely reliable because we need only one laser that will pump over a piece of silicon, and we can generate up to hundreds of wavelengths from a single device,” says Raghunathan. “We give a completely new vector of bandwidth scaling. The core IP is exclusively licensed from Columbia University and completely owned by us. The vision is to match in-package communication bandwidth to off-package communication escape bandwidth. And we think when we use our multicolor approach, we can match that so that giant datacenters – or multiple datacenters – behave as one big GPU.”

Right now, Xscape Photonics is not trying to make the network interfaces or switches that enable this disaggregated photonics fabric, but rather is trying to create the right low-powered, multi-color laser that others will want to buy to make these devices. They have one laser doing all of these frequencies in a market where others have to use multiple lasers to accomplish this. The idea is to reduce the overall power consumption of the interconnects for accelerators and their memories by 10X and also boost the bandwidth by 10X, which therefore reduces the energy per bandwidth by 100X.

 

It will be interesting to see who adopts this Xscape laser, and how.

Share
New Message
Please login to post a reply