Introduction
With the release of the NVIDIA Blackwell GPUs and RDNA 4-based Radeon 9000-series GPUs, we finally have consumer video cards that support the PCIe 5.0 standard. Although we have had motherboards with support for it for some time now, we didn’t have any devices other than storage that took advantage of it. This leads now to the question: What impact does PCIe 5.0’s increased bandwidth have on GPU performance in content creation applications?
PCI Express (abbreviated PCIe or PCI-e) is a technology used to connect various internal computer devices to the motherboard. The physical connectors and communication schema are used for drives, GPUs, and add-in cards like RAID or HBA cards and network cards. Since 2003, we have seen a variety of revisions and updates to the standard. Currently, the most common PCIe specification seen on new high-end motherboards is PCIe 5.0 at 16x, though often with some 4.0 lanes available.
The primary difference between PCI Express versions is transfer rate. A PCIe connection between devices has two defining features: the number of lanes and the PCIe version. Most slots on the motherboard have between four and sixteen lanes (x4, x8, or x16), with the occasional x1 or x2 slot. Each of these lanes has a maximum transfer rate, defined by the PCIe version. Since PCIe 3.0, each new version has doubled this transfer rate.
As an example, PCIe 5.0 supports up to 32 GT/s per lane. So, an x16 slot has 16 lanes each at 32 GT/s for a maximum throughput of 64 GB/s. If that same slot were using the ePCIe 4.0 protocol, it would have 16 lanes at 16 GT/s for a throughput of up to 32 GB/s. Alternatively, you could achieve 32 GB/s with x8 lanes at PCIe 5.0.
At present, consumer desktop motherboards tend to feature limited “free” PCIe lanes. We tend to be disappointed with the quantity, connectivity, and placement of PCIe slots on motherboards. Although it depends on the specific price point being targeted, many boards will have a primary 5.0 x16 slot, and then only a few other slots, typically at 4.0 x4 or even 3.0 x1. One reason we like the ASUS ProArt boards we often carry is in part due to the PCIe slot layout and support.
However, this isn’t merely motherboard vendors being cheap. Instead of maximizing add-in card support, they are typically dedicating many of the available PCIe lanes (from the CPU or chipset) to additional features like M.2 slots, USB ports, and Ethernet/WiFi. The drawback to this is that once a GPU is installed, there may be no way to add a second at full bandwidth (if at all). Even our preferred boards require that the GPU be run at x8 if we want to install most add-in cards or additional GPUs.
Given that add-in cards (GPUs or otherwise) may need to run at lower-than-maximum bandwidth due to PCIe lane availability concerns, it is reasonable to ask what the cost is. How much performance is lost when video cards are operated at less than their maximum PCIe bandwidth?
An Illustration of the PCIe Slot Problem
Gigabyte X870 Aorus Elite
ASUS TUF Gaming Z890-Plus
Gigabyte Z890 Aorus Elite
ASUS ProArt X870E Creator
MSI MAG X870 TOMAHAWK WIFI | GIGABYTE X870 AORUS ELITE WIFI7 | ASUS ROG STRIX X870E-E GAMING WIFI | MSI B650 GAMING PLUS WIFI |
---|---|---|---|
5.0 x16 3.0 x1 (x16 Form Factor) 4.0 x4 (x16 FF) | 5.0 x16 4.0 x4 (x16 FF) 3.0 x2 (x16 FF) | 5.0 x16 4.0 x4 (x16 FF) | 4.0 x16 3.0 x1 4.0 x4 (x16 FF) |
ASUS TUF GAMING Z890-PLUS WIFI | GIGABYTE Z890 AORUS ELITE WIFI7 | ASUS ProArt Z890-CREATOR WIFI | ASUS ProArt X870E-CREATOR WIFI |
---|---|---|---|
5.0 x16 4.0 x1 4.0 x1 4.0 x4 4.0 x4 (x16 FF) | 5.0 x16 4.0 x4 (x16 FF) 4.0 x4 (x16 FF) | 5.0 x16 5.0 x8 (reduces above to x8) 4.0 x4 (x16 FF) | 5.0 x16 5.0 x8 (reduces above to x8) 4.0 x4 (x16 FF) |
To illustrate the current difficulties of multiple PCIe devices in modern consumer motherboards, we grabbed a handful of the best-selling AM5 and LGA 1851 compatible motherboards on Newegg, alongside our preferred ASUS ProArt X870 and Z890 Creator boards. Most of these aren’t the cheapest options available, but they also aren’t the most expensive. They are, arguably, the most popular for new PC builds, though.
The first thing we notice when looking at these is that, save for the ASUS TUF board, none have more than 3 PCIe expansion slots. Of those three slots, none of them are actually more than 4.0 x4, save for the ProArt boards. On many boards, one of those three is even slower, at 3.0 x1 or x2. (Note that, while physically an x16 length slot, most of the non-primary slots are only electrically wired for x4 or less). The TUF board does offer a bevy of 4.0 x4 and 4.0 x1 slots, while the ProArts can do x16 in either of the top two slots, though both are limited to x8 when both are in use.
Of course, not everyone needs tons of add-in cards. For many users, a single GPU is the only one they’ll use. But we have found that professionals frequently require a GPU plus at least one add-in card. Based on this, we think the primary bandwidths to keep an eye on in the upcoming results are 5.0 x16, 5.0 x8, and 4.0 x4. For those on older motherboards considering a GPU upgrade, 3.0 x16 and x8 are also likely relevant.
Testing
Raw Results Tables
We choose our benchmarks to cover many workflows and tasks to provide a balanced look at the application and its hardware interactions. However, many users have more specialized workflows. Recognizing this, we like to provide individual results for benchmarks as well. If a specific area in an application comprises most of your work, examining those results will give a more accurate understanding of the performance disparities between components. Otherwise, we recommend skipping over this section and focusing on our more in-depth analysis in the following sections.
After Effects
DaVinci Resolve
Unreal Engine
Rendering
LLM (llama)
In the charts below, we have color-coded the bars by bandwidth, such that all the configurations which have the same bandwidth are the same color. For example, 64 GB/s (5.0 x16) is dark green, while 16 GB/s (5.0 x4, 4.0 x8, and 3.0 x16) are dark blue.
Video Editing / Motion Graphics: DaVinci Resolve Studio & After Effects
In both DaVinci Resolve and After Effects, we only included the “Overall” scores. This is because we saw little difference in the overall performance trends when we separated them by workflow. However, we have those in the raw results tables above if you want to see the specific performance scores for various workflows, such as 3D in After Effects or Intraframe media in Resolve.
Starting with DaVinci Resolve (Chart #1), we found that GPU PCIe bandwidth does noticeably affect overall performance. At the high end of the bandwidth spectrum, we see relatively similar performance from PCIe 5.0 x16, 5.0 x8, and 4.0 x16. We technically have the 5.0 x16 result ahead, but it is within what we would consider the margin of error for this type of testing. After those three, the next grouping is all the 16 GB/s combinations: 5.0 x4, 4.0 x8, and 3.0 x16. This cluster is about 90% as performant as the prior. We don’t love a 10% performance reduction just by having a slower slot, but it is often acceptable. However, the next tier down isn’t. 3.0 x8 and 4.0 x4 were only 75% as fast as the full-bandwidth (5.0 x16) result. Similarly, the slowest option, 3.0 x4, had only 54% the performance. While running a GPU in any of those combinations is likely rare, we definitely recommend avoiding configuring a GPU at these bandwidths for DaVinci Resolve.
We see less overall effect in After Effects (Chart #2). Visually, unlike DaVinci Resolve, the bars are less clustered by color, and there is less of a stair-step pattern. The slowest three bandwidths are the slowest three results, though. Here, the results for 64 GB/s to 16 GB/s are all within the margin of error, essentially random. Once we drop to 8 GB/s with 3.0 x8, we are outside that margin (though only with respect to the grouping). At 8 GB/s, 4.0 x4 is slower than the higher-bandwidth results. Finally, 3.0 x4 is 10% slower than 16 GB/s or greater configurations. Our recommendation would be to worry less about PCIe bandwidth in After Effects, but to try to avoid a really-low bandwidth situation like 3.0 x4.
Game Dev / Virtual Production: Unreal Engine
Our Unreal Engine benchmark results appear to be somewhere between DaVinci Resolve and After Effects. Like the former, there is clear clustering of the bandwidths, but, like AE, not many distinct “steps” exist. 5.0 x16, x8, and x4 as well as 4.0 x16 and x8, and 3.0 x16 are all functionally identical. 3.0 x16 looks like it may be a touch slower than the rest, but it is just within the margin of error for this testing. However, we do see results outside of that for the lower bandwidths. 4.0 x4 and 3.0 x8 are 93% as fast as the 64 GB/s results, and 3.0 x4 trails with 90% the performance.
Overall, none of these are huge differences in performance. As we discussed above, while a 10% performance hit isn’t great, it is also acceptable in some cases. We would urge caution when dropping a GPU to 4.0 x4 or below, but it may be a tradeoff worth making for multi-GPU or to facilitate add-in cards.
GPU Rendering: Blender & Octane
For this article, we tested with three rendering benchmarks: V-Ray, Blender, and Octane. However, our V-Ray results seemed particularly anomalous, so we haven’t included them in the charts, though they are in the results table above. In Blender and Octane, we see essentially no effect of bandwidth on performance. In the case of Blender, the total change from average is about 5%, while Octane is 2.5%. All the results are largely within the margin of error, and we can’t draw many conclusions. In this case, that means there is likely no effect. This makes sense as the scenes are all contained within GPU VRAM, and the loading time isn’t counted. Overall, there seems to be little to no downside to installing a GPU in a reduced-bandwidth situation for offline rendering applications.
AI: LLM (llama)
Finally, our Llama.cpp benchmark looks at GPU performance in prompt processing and token generation. For both workflows, the results seem effectively random, with no discernible pattern. The overall difference in performance is also fairly small, about 6% for prompt processing. Due to this, we would generally say that bandwidth has little effect on AI performance. However, we would caution that our LLM benchmark is very small, and LLM setups frequently involve multiple GPUs that are offloading some of the model to system RAM. In either of these cases, we expect that PCIe bandwidth could have a large effect on overall performance.
Does GPU PCIe Bandwidth Affect Content Creation Performance?
On modern motherboards, you often only get one PCIe slot at a full 5.0 x16 bandwidth. Additional slots may be 5.0 x8, but are likely much lower, at 4.0 x4 or below. Because of this, multi-GPU setups or configurations with add-in cards may find one or more GPUs with dramatically reduced PCIe bandwidth. Although most of the workflows we tested don’t show too much performance loss at 4.0 x4, that’s not true across the board.
In video editing/motion graphics, we saw the largest impact. PCIe 5.0 x16, x8, and 4.0 x16 were functionally equivalent. However, below that, we started to see some differences, especially in DaVinci Resolve. In that application, 3.0 x16 was 10% slower, and our typical-case 4.0 x4 was about 25% slower. These margins are reduced in After Effects, but still present. We recommend caution when configuring a system for video editing applications with multiple add-in cards, as reducing the number of lanes available to the GPU can have a measurable impact on performance.
Our Unreal Engine benchmark also showed performance impacts from PCIe bandwidth. However, the impacts are more minor. We only saw a noticeable hit once the bandwidth was reduced to 4.0 x4 (or equivalent), with an average fps drop of 7%. 3.0 x4 was slightly worse, at 10% slower than maximum bandwidth. While we are less concerned about this amount of lost performance, it should still be kept in mind.
Offline renderers and LLM benchmarks showed no impact from PCIe bandwidth on performance. This makes sense as both tend to load their work fully into GPU VRAM and crash if they can’t. There are some exceptions to this with LLMs, but operating out of system RAM is a huge slowdown. Thus, while reduced PCIe bandwidth may slow initial model or scene loading, it should have a negligible impact on performance after that. Our one note of caution here is that, in situations where you are pooling VRAM to fit a model, PCIe bandwidth may have a large effect. We were not able to test that here.
When we configure the systems we sell, we balance the need for maximum performance from components with the desire for add-in cards necessary for our customers to do their work. Frequently, this means reducing the primary GPU to PCI-e 5.0 x8, which reduces the PCI-e bandwidth in half. However, as we showed in this article, this major reduction in bandwidth often has a minimal impact on real-world performance. Outside of a few uncommon situations, this testing confirms that as long as you have a modern motherboard that supports PCIe 5.0, running the GPU at x8 speeds is not an issue. However, lower-end motherboards, which will require the GPU to run at 4.0 x4, may introduce performance penalties.