More DL
posted on
Aug 27, 2024 04:05PM
While seeking out DL interviews, I re-discovered one that Nodrog surely posted before. After reading (link and text at bottom), a tweak to my thinking in a previous post or two:
CAI's memory module has Samsung inside. The hyperscaler can choose between AMD or Nvidia GPUs for an AI system targeting specific applicatons. I expect AMD (CAI investor) to be heavily involved with CAI. An AMD+Photonic Fabric system would reduce power, save money and outperform an equivalent native Nvidia system of today.
Nvidia would have to convince a hyperscaler to buy Nvidia GPUs for a Nvidia+Photnic Fabric system. It could be that lower end (fewer flops), less expensive AMD GPUs will be given a preferrence.
-----------
https://www.eenewseurope.com/en/ceo-interview-celestial-ais-terabit-optical-interconnect/
David Lazovsky, CEO of Celestial AI tells eeNews Europe about its terabit optical interconnect and analog technology to reduce the power consumption of high performance AI chips and chiplets.
Thermal issues are at the heart of the challenge for high performance AI chips in the data centres. Hyperscaler operators are looking at ways to deliver more performance in the same thermal and power envelope.
Celestial AI in Santa Clara, California, has developed an optical waveguide to distribute signals to the chiplet grating coupler, providing more data at lower power consumption without increasing the thermal profile or the cost.
Backed by the imec.xpand innovation fund as well as major US funds, the company says it is locking up the majority of photonic capacity in the next two years for its technology.
This is based on photonic integrated circuits ({PICs) with an electroabsorption modulator (EAM) in the substrate and diodes implanted into the chip to capture the data. This changes the way chips are designed, rather than how they are manufactured.
Rather than being restricted by the size, location or performance of the I/O pads ni a system on chip or a chiplet, the diodes pack up the optical data from the optical interconnect in the interposer.
“There is no need to innovate from the manufacturing process with a thermally stable photonic interposer with data delivered into any point on the die,” David Lazovsky CEO of Celestial AI tells eeNews Europe. “The system is optimised for power and latency. Energy is becoming the gate for more AI and data transmission consumes 55 to 70% of power consumption, it’s not the compute power.”
“This is key to the differentiation of the entire stack,” says Lazovsky. “This allows a high performance ASIC with 100s of W in an existing bridge technology like CoWoS or EIB.”
The technology is being implemented in ASICs at 4 and 5nm with proven volume, he says. “The beauty is it is directly CMOS compatible. The one difference is one mask for germanium, its not capital intensive.”
“This enables us to use the photonic fabric as an architectural building block to deliver data to the point of consumption and to use the interposer for optical interconnect in a multichip module. A ring modulator is too thermally unstable, it’s an exercise in futility. This is because there are two different time domains, with transients in the ASIC with local heating from the ASIC to the PIC and the ambient timing.”
“We use an EAM with the same materials as conventional CMOS with photodiodes made from germanium silicon. So we look at the thermal budget of the system to get the maximum temperature and peg the thermal operating window for the wavelengths of interest in the L band
“The nice thing about GeSi EAM is the 85 degree operating window and use a DC bias to boost that further. By having a thermally stable modulator opens up a number of advantages,” he said.
This also changes the way the chips are designed.
“The right way to design these systems is not to use someone else’s SerDes, so we design the SerDes, TIA, as well as network convergence layer for protocol compatibility to make it easy to use. What we end up is an extremely short distance, 100um, and coupled with the low capacitance of the EAM of 50fF this results in very low power and the copper is not a transmission line.”
This is a purely analog design without the need for digital signal processing to compensate for noise or improve the quality of the signal.
“The inter-symbol interference (ISI) in the system is in the transimpedance amplifier (TIA) which we have also designed in 4nm. We have some of the best analog designers in the industry working with us.”
“The benefit of low noise through the electrical IC eliminates the need for DSP to clean up the signal and this is an power and area consumer. We are gated by the control electronics and we don’t have the DSP.”
This leads to a a focus on high bandwidth memory systems with HBM3e and HBM4 device.
The first generation provides bandwidth of 56Gb/s NRZ with four optical channels, but this isn’t limited to one set of interfaces.
“We have four 56G lanes bonded together for a 224Gb/s lane in Gen1. The only reason to go to WDM is to reduce fibre cost. With Gen 2 we have 8 lanes.”
This is highly power efficient, which comes from controlling both end s of the link, he says.
“There is no more requirement for us to use a standard as we have both sides of the link, so that allows us to use the right modulator, and that’s in the L-band and C-band and we have built a war chest of IP.”
“We are working with the major OSATs on the interposer. We are in the process of scaling up capacity. We will consume a huge percentage of silicon photonics capacity in 2025 and 2026.”
“With Gen 1, 2.4pJ/bit is where we are today. Gen 2 at the end of 2025 will have 112Gb/s PAM4 with 8 wavelengths. This is 0.025pJ/bit in the chip and 0.7 pJ/s package to package.”
“Even for gen1 the bandwidth density is a terabit/s per mm squared as we can deliver data anywhere. We use the photonic fabric as an architectural toolbox to provide optical interconnectivity.”
There aren’t that many customers for this level of technology, but there is a lot of customisation.
“Four hyperscalers represent 70% of a $1tn market. As it turns out most customers want the same thing. We have multiple ways to provide connectivity to compute, whether that’s UCIe or MaxPhy connection or to interconnect processors chip to chipor point to point to replace PCIe.”
And its not just the processing but also the memory sub-system.
“We are building disaggregated memory where all of the switching is done electronically – we have a switch integrated in our memory modules,” he said. “It’s a 5nm ASIC that is integrated into a memory module with HBM3e controller and PHY, DDR5 controller and PHY to support up to 8 stacks and the photonic fabric and 8Tbit/s electronic switch. This provides the ability to provide this module as a router or to configure the memory access they want, pure DDR or pure HBM or use HBM3 as a write through cache to DDR. This gives the cost structure of DDR and the latency and bandwidth of HBM3 with 32 and 64 interleaved pseudo channels that can hide the latency.”
This increases memory capacity per chiplet from 36GByte to 1440GByte while reducing the power consumption. “That would be 64.2pJ/bit over [the Nvidia] NVlink vs 6.2pJ/bit for optical link, it’s a game changer for generative AI,” he said.
The ASIC is being built on a 5nm process. “4nm is a good sweet spot. For the memory module we are building it in 5nm as it is a sweet spot for power and cost.”
This approach also future proofs the data centre resources by decoupling the memory from the compute as it is the HBM capacity that is the bottleneck for power and cost.
“An Nvidia H100 has 80G and that is $485/Gbyte of memory. The cost of memory in our memory appliance is under $10/Gbyte so its disruptive from a memory, cost and energy standpoint,” he said.
“That is the guts of the technology. We are working with three to four hyperscalers and we have more demand than we can handle as we have 120 to 160 people by end of the years and partner with Broadcom which keeps the hyperscalers comfortable,” he said.
“Going forward what our customers love is the roadmap. We are already using the interposer in the memory module and we have custom design services for the PIC and we can build it or they can build it.”
This has led to significant fundraising and a ‘unicorn’ startup valued in the billions of dollars.
“I have raised $340m through to series C and in the last 18 months we raised $275m and that wil carry us to positive operations on our current plans,” said Lazovsky. “We can use our supply chain relationships, and our customers are investing with us to get our technology designed into their accelerators, we are not funding that, they are, and that’s really important. The value creation is so significant that they are investing with us. The engagement model is just as crucial as the technology.”
“We have built a unicorn already. This is worth 100x, this is big, big opportunity for IPO. That’s definitely the way we are looking.”