未来的芯片将比以往任何时候都更热
Future Chips Will Be Hotter Than Ever

原始链接: https://spectrum.ieee.org/hot-chips

摩尔定律驱动的晶体管密度不断增加,导致现代CPU和GPU出现严重的散热问题,影响性能、功耗和芯片寿命。曾经允许随着晶体管尺寸缩小而降低电压的Dennard缩放失效,导致功率密度和发热量急剧增加。目前的散热方法不足以应对未来纳米片晶体管和CFET等芯片技术带来的更严峻挑战。 解决方案范围从先进的散热技术(如微流体冷却)到动态管理热量的系统级适配,但这些方法往往会影响性能。一种有前景的方法是在芯片背面增加功能,例如背面电源网络(BSPDN),以降低电压和热量。然而,这些技术也可能带来新的散热挑战。 业界需要一种跨学科的方法,结合新材料、晶体管设计、系统控制、封装和冷却技术。系统技术协同优化 (STCO) 强调一个整体的设计过程,将系统、物理设计和工艺技术结合起来,以有效地应对日益严重的散热挑战。

一个Hacker News帖子讨论了一篇关于未来芯片可能运行温度更高的IEEE文章。一位评论者FirmwareBurner列举了过去一些运行温度很高的科技产品,例如奔腾4和PS3,暗示这些并非什么新问题。TMWNN补充了AMD Thunderbird到这个列表中,并指出有时问题的解决方案会更糟糕,他以噪音很大的TRS-80 Model II为例,这种噪音会给用户造成身体上的不适。FirmwareBurner接着开玩笑地建议,现代电脑应该模拟硬盘和软驱的声音,并添加触觉反馈。

原文

For over 50 years now, egged on by the seeming inevitability of Moore’s Law, engineers have managed to double the number of transistors they can pack into the same area every two years. But while the industry was chasing logic density, an unwanted side effect became more prominent: heat.

In a system-on-chip (SoC) like today’s CPUs and GPUs, temperature affects performance, power consumption, and energy efficiency. Over time, excessive heat can slow the propagation of critical signals in a processor and lead to a permanent degradation of a chip’s performance. It also causes transistors to leak more current and as a result waste power. In turn, the increased power consumption cripples the energy efficiency of the chip, as more and more energy is required to perform the exact same tasks.

The root of the problem lies with the end of another law: Dennard scaling. This law states that as the linear dimensions of transistors shrink, voltage should decrease such that the total power consumption for a given area remains constant. Dennard scaling effectively ended in the mid-2000s at the point where any further reductions in voltage were not feasible without compromising the overall functionality of transistors. Consequently, while the density of logic circuits continued to grow, power density did as well, generating heat as a by-product.

As chips become increasingly compact and powerful, efficient heat dissipation will be crucial to maintaining their performance and longevity. To ensure this efficiency, we need a tool that can predict how new semiconductor technology—processes to make transistors, interconnects, and logic cells—changes the way heat is generated and removed. My research colleagues and I at Imec have developed just that. Our simulation framework uses industry-standard and open-source electronic design automation (EDA) tools, augmented with our in-house tool set, to rapidly explore the interaction between semiconductor technology and the systems built with it.

The results so far are inescapable: The thermal challenge is growing with each new technology node, and we’ll need new solutions, including new ways of designing chips and systems, if there’s any hope that they’ll be able to handle the heat.

The Limits of Cooling

Traditionally, an SoC is cooled by blowing air over a heat sink attached to its package. Some data centers have begun using liquid instead because it can absorb more heat than gas. Liquid coolants—typically water or a water-based mixture—may work well enough for the latest generation of high-performance chips such as Nvidia’s new AI GPUs, which reportedly consume an astounding 1,000 watts. But neither fans nor liquid coolers will be a match for the smaller-node technologies coming down the pipeline.

A rainbow-colored shape similar to a capital \u201cI\u201d beside a line chart.Heat follows a complex path as it’s removed from a chip, but 95 percent of it exits through the heat sink. Imec

Take, for instance, nanosheet transistors and complementary field-effect transistors (CFETs). Leading chip manufacturers are already shifting to nanosheet devices, which swap the fin in today’s fin field-effect transistors for a stack of horizontal sheets of semiconductor. CFETs take that architecture to the extreme, vertically stacking more sheets and dividing them into two devices, thus placing two transistors in about the same footprint as one. Experts expect the semiconductor industry to introduce CFETs in the 2030s.

In our work, we looked at an upcoming version of the nanosheet called A10 (referring to a node of 10 angstroms, or 1 nanometer) and a version of the CFET called A5, which Imec projects will appear two generations after the A10. Simulations of our test designs showed that the power density in the A5 node is 12 to 15 percent higher than in the A10 node. This increased density will, in turn, lead to a projected temperature rise of 9 °C for the same operating voltage.

Two colorful and textured rectangles and a graph with two lines sweeping up and to the right.Complementary field-effect transistors will stack nanosheet transistors atop each other, increasing density and temperature. To operate at the same temperature as nanosheet transistors (A10 node), CFETs (A5 node) will have to run at a reduced voltage. Imec

Nine degrees might not seem like much. But in a data center, where hundreds of thousands to millions of chips are packed together, it can mean the difference between stable operation and thermal runaway—that dreaded feedback loop in which rising temperature increases leakage power, which increases temperature, which increases leakage power, and so on until, eventually, safety mechanisms must shut down the hardware to avoid permanent damage.

Researchers are pursuing advanced alternatives to basic liquid and air cooling that may help mitigate this kind of extreme heat. Microfluidic cooling, for instance, uses tiny channels etched into a chip to circulate a liquid coolant inside the device. Other approaches include jet impingement, which involves spraying a gas or liquid at high velocity onto the chip’s surface, and immersion cooling, in which the entire printed circuit board is dunked in the coolant bath.

But even if these newer techniques come into play, relying solely on coolers to dispense with extra heat will likely be impractical. That’s especially true for mobile systems, which are limited by size, weight, battery power, and the need to not cook their users. Data centers, meanwhile, face a different constraint: Because cooling is a building-wide infrastructure expense, it would cost too much and be too disruptive to update the cooling setup every time a new chip arrives.

Performance Versus Heat

Luckily, cooling technology isn’t the only way to stop chips from frying. A variety of system-level solutions can keep heat in check by dynamically adapting to changing thermal conditions.

One approach places thermal sensors around a chip. When the sensors detect a worrying rise in temperature, they signal a reduction in operating voltage and frequency—and thus power consumption—to counteract heating. But while such a scheme solves thermal issues, it might noticeably affect the chip’s performance. For example, the chip might always work poorly in hot environments, as anyone who’s ever left their smartphone in the sun can attest.

Another approach, called thermal sprinting, is especially useful for multicore data-center CPUs. It is done by running a core until it overheats and then shifting operations to a second core while the first one cools down. This process maximizes the performance of a single thread, but it can cause delays when work must migrate between many cores for longer tasks. Thermal sprinting also reduces a chip’s overall throughput, as some portion of it will always be disabled while it cools.

System-level solutions thus require a careful balancing act between heat and performance. To apply them effectively, SoC designers must have a comprehensive understanding of how power is distributed on a chip and where hot spots occur, where sensors should be placed and when they should trigger a voltage or frequency reduction, and how long it takes parts of the chip to cool off. Even the best chip designers, though, will soon need even more creative ways of managing heat.

Making Use of a Chip’s Backside

A promising pursuit involves adding new functions to the underside, or backside, of a wafer. This strategy mainly aims to improve power delivery and computational performance. But it might also help resolve some heat problems.

Four multilayer rectangles hover above a series of squigglesNew technologies can reduce the voltage that needs to be delivered to a multicore processor so that the chip maintains a minimum voltage while operating at an acceptable frequency. A backside power-delivery network does this by reducing resistance. Backside capacitors lower transient voltage losses. Backside integrated voltage regulators allow different cores to operate at different minimum voltages as needed.Imec

Imec foresees several backside technologies that may allow chips to operate at lower voltages, decreasing the amount of heat they generate. The first technology on the road map is the so-called backside power-delivery network (BSPDN), which does precisely what it sounds like: It moves power lines from the front of a chip to the back. All the advanced CMOS foundries plan to offer BSPDNs by the end of 2026. Early demonstrations show that they lessen resistance by bringing the power supply much closer to the transistors. Less resistance results in less voltage loss, which means the chip can run at a reduced input voltage. And when voltage is reduced, power density drops—and so, in turn, does temperature.

Two stacks of blocks and four colorful squares that become increasingly dominated by reds and oranges. By changing the materials within the path of heat removal, backside power-delivery technology could make hot spots on chips even hotter. Imec

After BSPDNs, manufacturers will likely begin adding capacitors with high energy-storage capacity to the backside as well. Large voltage swings caused by inductance in the printed circuit board and chip package can be particularly problematic in high-performance SoCs. Backside capacitors should help with this issue because their closer proximity to the transistors allows them to absorb voltage spikes and fluctuations more quickly. This arrangement would therefore enable chips to run at an even lower voltage—and temperature—than with BSPDNs alone.

Finally, chipmakers will introduce backside integrated voltage-regulator (IVR) circuits. This technology aims to curtail a chip’s voltage requirements further still through finer voltage tuning. An SoC for a smartphone, for example, commonly has 8 or more compute cores, but there’s no space on the chip for each to have its own discrete voltage regulator. Instead, one off-chip regulator typically manages the voltage of four cores together, regardless of whether all four are facing the same computational load. IVRs, on the other hand, would manage each core individually through a dedicated circuit, thereby improving energy efficiency. Placing them on the backside would save valuable space on the frontside.

It is still unclear how backside technologies will affect heat management; demonstrations and simulations are needed to chart the effects. Adding new technology will often increase power density, and chip designers will need to consider the thermal consequences. In placing backside IVRs, for instance, will thermal issues improve if the IVRs are evenly distributed or if they are concentrated in specific areas, such as the center of each core and memory cache?

Recently, we showed that backside power delivery may introduce new thermal problems even as it solves old ones. The cause is the vanishingly thin layer of silicon that’s left when BSPDNs are created. In a frontside design, the silicon substrate can be as thick as 750 micrometers. Because silicon conducts heat well, this relatively bulky layer helps control hot spots by spreading heat from the transistors laterally. Adding backside technologies, however, requires thinning the substrate to about 1 mm to provide access to the transistors from the back. Sandwiched between two layers of wires and insulators, this slim silicon slice can no longer move heat effectively toward the sides. As a result, heat from hyperactive transistors can get trapped locally and forced upward toward the cooler, exacerbating hot spots.

Our simulation of an 80-core server SoC found that BSPDNs can raise hot-spot temperatures by as much as 14 °C. Design and technology tweaks—such as increasing the density of the metal on the backside—can improve the situation, but we will need more mitigation strategies to avoid it completely.

Preparing for “CMOS 2.0”

BSPDNs are part of a new paradigm of silicon logic technology that Imec is calling CMOS 2.0. This emerging era will also see advanced transistor architectures and specialized logic layers. The main purpose of these technologies is optimizing chip performance and power efficiency, but they might also offer thermal advantages, including improved heat dissipation.

In today’s CMOS chips, a single transistor drives signals to both nearby and faraway components, leading to inefficiencies. But what if there were two drive layers? One layer would handle long wires and buffer these connections with specialized transistors; the other would deal only with connections under 10 mm. Because the transistors in this second layer would be optimized for short connections, they could operate at a lower voltage, which again would reduce power density. How much, though, is still uncertain.

Six horizontal rectangles with different blocky designs in each hover over each other. In the future, parts of chips will be made on their own silicon wafers using the appropriate process technology for each. They will then be 3D stacked to form SoCs that function better than those built using only one process technology. But engineers will have to carefully consider how heat flows through these new 3D structures. Imec

What is clear is that solving the industry’s heat problem will be an interdisciplinary effort. It’s unlikely that any one technology alone—whether that’s thermal-interface materials, transistors, system-control schemes, packaging, or coolers—will fix future chips’ thermal issues. We will need all of them. And with good simulation tools and analysis, we can begin to understand how much of each approach to apply and on what timeline. Although the thermal benefits of CMOS 2.0 technologies—specifically, backside functionalization and specialized logic—look promising, we will need to confirm these early projections and study the implications carefully. With backside technologies, for instance, we will need to know precisely how they alter heat generation and dissipation—and whether that creates more new problems than it solves.

Chip designers might be tempted to adopt new semiconductor technologies assuming that unforeseen heat issues can be handled later in software. That may be true, but only to an extent. Relying too heavily on software solutions would have a detrimental impact on a chip’s performance because these solutions are inherently imprecise. Fixing a single hot spot, for example, might require reducing the performance of a larger area that is otherwise not overheating. It will therefore be imperative that SoCs and the semiconductor technologies used to build them are designed hand in hand.

The good news is that more EDA products are adding features for advanced thermal analysis, including during early stages of chip design. Experts are also calling for a new method of chip development called system technology co-optimization. STCO aims to dissolve the rigid abstraction boundaries between systems, physical design, and process technology by considering them holistically. Deep specialists will need to reach outside their comfort zone to work with experts in other chip-engineering domains. We may not yet know precisely how to resolve the industry’s mounting thermal challenge, but we are optimistic that, with the right tools and collaborations, it can be done.

From Your Site Articles

Related Articles Around the Web

联系我们 contact @ memedata.com