Copper's reach is shrinking so Broadcom is strapping optics directly to GPUs
posted on
Aug 28, 2024 10:34AM
Broadcom's plan for faster AI clusters: strap optics to GPUs • The Register
In modern AI systems, using PCIe to stitch together accelerators is already too slow. Nvidia and AMD use specialized interconnects like NVLink and Infinity Fabric for this reason – but at the 900-plus GB/sec these links push, copper will only carry you so far.
According to Manish Mehta, VP of marketing and operations for Broadcom's optical systems division, in high-end clusters, copper can only carry you about three to five meters before the signal starts to break down. And as serializer deserializers (SerDes) push beyond 200Gbit/sec, copper's reach is only going to get shorter.
The answer, as you might expect, is to ditch copper in favor of optics – even though that comes at the expense of increased power consumption. Nvidia has estimated that using optics rather than copper for its NVL72 systems would have required another 20 kilowatts per rack – on top of the 120 kilowatts for which they're already rated.
While the individual transceivers don't pull that much power – just 13 to 15 watts each, according to Mehta – that adds up pretty quickly when you're talking about multiple switches with 64 or 128 ports each. "If there is a need for scale-up to progress to higher reach and therefore optical connectivity, you're going to need 10x of that bandwidth, and that will just be unachievable with this type of paradigm," he explained during a speech at the Hot Chips conference this week.
Instead, Broadcom is now experimenting with co-packaging the optics directly into the GPUs themselves.
Co-packaged optics (CPO) is something Broadcom has explored for several years. You may recall back in 2022, when the network giant showed off its Humboldt switches, which offered a 50/50 mix of traditional electrical and co-packaged optical interfaces.
A few months later in early 2023, Broadcom demoed a second generation CPO switch with twice the bandwidth at 51.2Tbit/sec, which stitched eight 6.4Tbit/sec optical engines to a Tomahawk 5 ASIC for 64 purely 800Gbit/sec ports. More importantly, by doing so Mehta claims Broadcom was able to cut per-port power consumption to a third the power at five watts per port.
Last year, Broadcom demonstrated a 51.2Tbit/sec CPO switch with 64 pure optical 800G ports – Click to enlarge
Broadcom's latest endeavor, disclosed at Hot Chips this week, has been to co-package one of these optical engines to a GPU, allowing for roughly 1.6TB/sec of total interconnect bandwidth – that's 6.4Tbit/sec or 800GB/sec in each direction – to each chiplet, while demonstrating "error free performance," Mehta explained. That puts it in the same ballpark as Nvidia's next-gen NVLink fabrics due to ship alongside its Blackwell generation, which will deliver 1.8TB/sec of total bandwidth to each GPU over copper.
To test the viability of optically interconnected chips, Broadcom copackaged the optics to a test chip designed to emulate a GPU – Click to enlarge
To be clear, there's no A100 or MI250X floating around with a Broadcom optical interconnect on it. Not that we're aware of, anyway. The GPU in Broadcom's experiments is actually just a test chip designed to emulate the real thing. To that end, it uses TSMC's chip-on-wafer-on-substrate (CoWoS) packaging tech to bond a pair of HBM stacks to the compute die. But while the chip logic and memory sits on a silicon interposer, Broadcom's optical engine is actually located on the substrate.
This is important, because essentially every high-end accelerator that uses HBM relies on CoWoS-style advanced packaging – even if Broadcom's own chiplet doesn't need it.
According to Mehta this kind of connectivity could support 512 GPUs in as few as eight racks, acting as a single scale-up system.
Broadcom argues that co-packaged optics could allow for large scaleup systems comprised of hundreds of GPUs – Click to enlarge
Now, you might be thinking, aren't Amazon, Google, Meta, and a whole host of datacenter operators deploying clusters of 10,000 or more GPUs already? They certainly are – but these clusters fall in the scale-out category. Work is distributed over comparatively slow Ethernet or InfiniBand networks to systems with at most eight GPUs.
What Mehta is talking about is scale-up systems, like Nvidia's NVL72. Only instead of 72 GPUs made to work as one big one, the fabric is fast enough and can reach far enough to have hundreds acting as one giant accelerator.
In addition to pushing the optical engines beyond 6.4Tbit/sec, Mehta also sees the potential to stitch multiple chiplets on a compute package.
If any of this sounds familiar, it's because Broadcom isn't the first to attach optical interconnects to a chip. Earlier this year, Intel revealed it had successfully co-packaged an optical chiplet capable of 4Tbit/s of bidirectional bandwidth. And last year, Chipzilla showed off a similar concept using co-packaged optical chiplets developed by Ayar Labs.
A number of other silicon photonics startups have also emerged promising similar capabilities – including LightMatter and Celestial AI, whose products are in various stages of development and production.
And while there's no Instinct GPU or APU from AMD with co-packaged optics that we know of just yet, this spring AMD CTO Mark Papermaster and SVP Sam Naffzigger discussed the possibility of such a chip. ®