硬件感知编码：每个开发者都应该了解的CPU架构概念

硬件感知编码：每个开发者都应该了解的CPU架构概念
Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know

原始链接: https://blog.codingconfessions.com/p/hardware-aware-coding

为了实现卓越的软件性能，“硬件感知编码”至关重要。现代CPU是复杂的系统，拥有精巧的机制来高效执行指令。指令流水线允许CPU同时处理多条指令，从而提高吞吐量。内存缓存利用时间局部性和空间局部性，通过将频繁访问和连续访问的数据存储在更靠近CPU的地方来减少延迟。推测执行使用分支预测来保持指令流水线饱满，尽管预测错误会造成性能损失。编写具有“机械同情心”的代码意味着使算法和数据结构与这些硬件实际情况相符。循环展开可以增加指令级并行性。针对空间局部性优化数据结构布局，并将只读数据与读写数据分离，可以提高缓存利用率。在多维数组中以行优先的方式排列数据访问模式可以最大限度地减少缓存未命中。编写符合硬件预期的代码，可以改变软件性能。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录硬件感知编码：每个开发者都应该了解的 CPU 架构概念 (codingconfessions.com) 7 分，来自 abhi9u，2 小时前 | 隐藏 | 往期 | 收藏 | 1 条评论 belter 12 分钟前 [–] AI 编写的……别浪费时间…… 回复加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创企业学校！指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

每个开发人员都应该了解的 GPU 计算知识 2023-10-22

（评论） 2023-11-30

关于计算机体系结构、操作系统和语言的经典新闻组帖子 2024-05-20

通过价值投机击败 L1 缓存（2021） 2024-07-14

原文

Even the most elegant algorithms can run painfully slow when they fight against your computer's underlying hardware. The difference between mediocre and exceptional performance often comes down to whether your code works with—or against—the CPU's architecture.

The processors powering our machines aren't simple calculators; they're sophisticated systems with intricate mechanisms for executing instructions efficiently. Understanding these mechanisms is what I call "hardware-aware coding"—a skill that can transform your software's performance without requiring you to become a hardware engineer.

In this article, I'll use a drive-through restaurant analogy to explain three CPU architecture concepts every developer should know:

instruction pipelining
memory caching
speculative execution.

You'll see how small adjustments to your code can dramatically improve performance by aligning with these hardware realities. Let's learn how to write code that your CPU actually wants to run.

Use the Stream SDK to build powerful media apps

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video and Audio APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

Throughout the article we will use the analogy of a drive-through restaurant where people drive through in their vehicles, stop by at the order window, place their order, collect it and move on. There is usually a queue of cars waiting to place their orders and the restaurant staff needs to optimize their order taking, and preparation pipeline to deliver the orders as fast as possible and as many orders as possible.

A drive-through restaurant depicted in the form of a CPU

This analogy applies closely to how a CPU executes instructions.

Now, let’s use this analogy to understand how the hardware optimises processing and execution of instructions.

In the beginning, the restaurant was running into losses. They had all the staff and equipment but everything was underutilised. The restaurant was not delivering enough orders to make profits. The owner hires an expert to figure out how to improve things.

The expert comes and observes what is happening. He notices that when a customer arrives in her car, until her order is not delivered, all other customers have to wait.

If it takes ten minutes to prepare one order, then the restaurant can only deliver one order in ten minutes, or six orders per hour. This long wait time was leading to very few orders being taken and delivered, while most of the staff and equipment was lying idle and unused.

After realizing the source of the inefficiency, the expert comes up with the idea of building a pipeline for processing orders. The pipeline breaks down order processing into a set of stages. As an order moves from one stage to another, it makes space for another order to enter the pipeline. An example of such stages could be:

An AI generated illustration of the restaurant’s pipeline stages. The cars move forward as their order moves from one stage to another, which makes space for one new car to place their order.

There are five stages and assuming each stage takes two minutes in the ideal scenario, the restaurant can finish processing an order in 10 minutes.

It will take the restaurant 10 minutes to deliver the very first order but after that the pipeline will be full and the restaurant will start to finish one order every two minutes. We will refer to it as the steady state of the pipeline.

With this pipelined order processing, the restaurant can deliver 30 orders per hour as compared to the previous regime which was only able to process 6 orders per hour—a 5x higher order throughput, i.e. more revenue.

Even the cooking of an item can be broken down into individual stages. For instance, making a burger can consist of toasting the burger, adding the patty and vegetables, and finally adding cheese. As a burger moves from one stage to another, a new burger can be started to get cooked.

The following diagram visualizes the stages of the pipeline and how the first five orders proceed through it.

Left: The stages of the order processing pipeline. Right: The first five orders through the pipeline, by the time the first order is ready, we have five orders in the pipeline and at this point the restaurant can finish one order every 2 minutes.

For example, if the instruction is accessing memory then the execution units (adders, multipliers) remain unused. This is similar to how the restaurant was inefficiently using its resources when it was processing only one order at a time.

To improve this situation, Instruction processing is pipelined in the processors to enable them to process multiple instructions simultaneously. The pipeline consists of several stages and as one instruction moves from the first stage to the second, it makes space for a new instruction to be introduced into the pipeline.

The number of stages in a CPU pipeline varies. Some simpler architectures consist of a five stage pipeline while there are also more complex pipelines in high performance architectures (such as x64) that have very deep pipelines consisting of 15-20 stages. Let’s use a simple five stage pipeline as an example to understand instruction pipelining better. These five stages are:

This pipeline has five stages and in a steady state it can process five instructions at a time and retire (i.e. successfully finish) one instruction every CPU cycle.

This means that while an instruction is being executed, others might be getting fetched, and decoded, similar to how while an item is being cooked, a new order is being taken and another order’s ingredients are being prepared.

The restaurant owner liked how the restaurant was making profits. He asks the expert if he can scale the profits even more. The expert gets back to the job. He thinks the pipeline is doing so well, what if he hires more staff and creates more parallel pipelines so that the restaurant could take multiple orders parallelly.

He installs three more order windows for taking orders, so now four customers can place their orders simultaneously. He also hires more workers, cooks and equipment (ovens, stoves, friers) to independently work on the pipeline stages of those orders in parallel.

With a five-stage order pipeline and four such pipelines, the restaurant can now have 20 orders under process in a steady state, and it can deliver 4 completed orders every two minutes to the customers. This is 4x the previous throughput.

Modern X64 processors can issue up to 4 new instructions each cycle and with their deep pipelines, they can have hundreds of instructions in different stages of completion. Under a steady state, such a processor can deliver a throughput of up to 4 instructions/cycle (peak instruction throughput).

Taking advantage of this instruction level parallelism (ILP) requires that the processor is able to find enough independent units of work which can be executed in parallel in the same cycle.

Usually, a short sequence of instructions in a program are all dependent. For example, the immediate next instruction might be dependent on the current instruction’s result. So the processor looks at a larger window of instructions to identify independent instructions that can be executed in parallel.

This naturally means that the processor executes instructions out of their natural program order. It is called out-of-order execution and it enables processors to find and execute multiple instructions in parallel. Even though the processor executes instructions out-of-order, their results are committed to the process state in their original order, it ensures that the program logic remains correct.

Modern X64 processors are capable of issuing up to four instructions per cycle. It means that with their deep pipelines, these processors can process hundreds of instructions per cycle and retire up to four instructions per cycle.

So how does all this knowledge about pipelining superscalar and out-of-order instruction execution help us improve the performance of our code? We know that achieving peak instruction throughput requires that the processor finds enough independent instructions to execute in parallel. So are there ways to do this? Let’s see an example.

Let’s say every order pipeline in the restaurant has one fryer. A customer comes and places an order for four fries. The order is placed in the pipeline but because there is only one fryer on that pipeline, until the first fry is not done, the next one cannot be prepared. This means that particular order window is blocked and no more orders can be taken in the meantime.

If those four fries were distributed on the different pipelines, they could all be prepared in parallel and independently. Neither the customer would have to wait extra long, nor the order window would have been blocked.

This is the idea of mechanical sympathy. By organizing our code in a way which works in synergy with the way the hardware works, we can get maximum performance possible.

In the case of software, we can have bottlenecks which don’t allow the processor to execute as many instructions in parallel as it can. In order to fully utilize the superscalar capabilities of the hardware, there have to be enough independent instructions in the code. But often, there are situations where the code is written such that we have a sequence of instructions where the next instruction depends on the previous one’s result. For example:

for (int i = 0; i < size; i++) {
  // Each addition depends on previous result
  sum += array[i];
}

The optimization to fix such a bottleneck is called “loop unrolling”. It basically means executing multiple steps of the loop in a single iteration. For example, we can compute four parallel sums in each iteration. Each sum value can be updated independently of the other one, the CPU will notice that and execute those four add operations in parallel.

// Better superscalar utilization - independent operations
int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < size; i += 4) {
  sum1 += array[i];
  sum2 += array[i+1];
  sum3 += array[i+2];
  sum4 += array[i+3];
}
int sum = sum1 + sum2 + sum3 + sum4;

Usually, we don’t have to do such kind of optimization ourselves, the compilers these days can do this for us. However, when you notice below par instruction throughput, you may want to check the compiler’s output to see what is going on. Sometimes, the compiler may not be able to do the right optimization and you may have to roll up your sleeves to do it yourself.

While pipelining of orders was a great idea, it was still not as successful as it could be. Orders were taking a long time to be prepared, which was preventing the restaurant from accepting more orders and adding them to the pipeline.

On analysis, the expert finds that this was happening because the chefs would not start cooking an item until all of the required ingredients were brought to them. And instead of keeping them locally stocked, the ingredients were being delivered from the market on demand. It took 10 minutes to get something delivered from the market–the amount of time in which the restaurant could have delivered 5 orders if this stall was not there.

The expert optimizes this by installing local shelves and fridges to stock the ingredients which have been in-demand recently, he calls it the ingredients cache.

He designed this based on the observation that usually at certain hours of the day, certain items are in demand and in that case keeping their ingredients helps preparing these items faster. The technical name of this property is temporal locality, which means an item which is required now is very likely to be required in the near future as well. Of course, when people are ordering random things then this optimization is not of much use.

Another thing about these shelves and fridges is that they have limited space, so when they run out of space, the restaurant needs to throw away some of the ingredients to make space for the ones which are required right now but not present in the cache. This is called the cache eviction policy.

Apart from the local shelves and fridges, the restaurant also builds a bigger cache in the basement from where any ingredient can be fetched in 2 minutes. And, whatever is not present even in the basement, gets delivered from the market. This gives rise to a hierarchy of ingredient caches, each with larger capacity and larger access times.

Just like items of an order need ingredients immediately available for cooking, the CPU needs the operand values for an instruction to be available in the CPU to execute them.

For example, consider a line of code a = b + c, where a, b, c are integers. To execute this code, the values of the variables b and c need to be brought into the CPU registers from memory.

All the program data resides in the main memory, and it takes 100-200 cycles to read any data from there. In the above example, if both the values need to be fetched from main memory, the simple add operation will take 200-400 cycles which is extremely slow. For comparison, the CPU can do the addition within a single cycle if the operands are in its registers, so this is 200x slower at the very least.

Just like the restaurant built a hierarchy of storages to retrieve the ingredients faster, the processors come with a hierarchy of cache memories. These days we usually have three levels of caches in the processors, called L1, L2 and L3 caches–each larger and slower and farther from the CPU than the previous level. The access latency of the L1 cache is usually 3-4 cycles which is significantly faster than main memory but it is of very small capacity, 32-64 kB on modern X64 hardware.

The following diagram illustrate the sizes of all the memories in the hardware and their latencies:

Just like the restaurant keeps the currently in-demand ingredients in its cooking floor cache, processors also keep the most recently used data in the L1 cache, anticipating temporal locality.

As these caches are small, they also employ an eviction policy to make space for new data. Most of the processors implement some approximate variant of the least recently used eviction (LRU) policy, which means evict cache lines which have not been accessed for a while.

Apart from temporal locality, the caches exploit another property of data, called spatial locality. Let’s understand through the restaurant example.

While caching ingredients improved speed, another inefficiency remained. When preparing a dish whose ingredients weren't on the cooking floor, the restaurant faced a problem with how their warehouse was organized.

Initially, ingredients were stored randomly throughout the warehouse - flour might be in aisle 1, tomato sauce in aisle 7, and cheese in aisle 12. The delivery person could only visit one location per trip, bringing whatever was at that location. This meant multiple trips were needed to gather all ingredients for a single recipe, causing significant delays.

The restaurant reorganized their warehouse so that ingredients commonly used together were stored at the same location. Pizza ingredients (flour, sauce, cheese, toppings) were all placed on the same shelf, as were burger ingredients in another section.

Now, when the delivery person made a single trip to get flour for a pizza, they automatically brought back all the other pizza ingredients stored at that same location. This "spatial locality optimization" meant one trip could supply everything needed for a recipe, dramatically reducing preparation time.

Just as the restaurant groups ingredients that are used together, computers benefit from similar spatial organization of data in memory. Let's see how this applies to our processors.

As a result caches are optimised for spatial locality of data. They assume that the related data is stored contiguously or nearby in memory and instead of bringing just the value at the requested address into the cache, the cache brings an entire block of data. The block size depends on the memory bus capacity, which is usually 64 bytes on present day hardware.

The CPU caches are also organized to cache these entire blocks rather than caching the exact value that the application requested. Within the cache, these blocks are called cache lines and these are also 64 bytes in size (on X64 hardware at least).

Consider this example: When an application requests an 8-byte long value stored at memory address 130, the hardware doesn't retrieve only those specific bytes. Instead, it fetches the entire 64-byte block starting at address 128 (the nearest 64-byte aligned address) and loads this block into one of the available cache lines. Subsequently, if the program attempts to access any value within the address range of 128-191, that data will already reside in the cache, eliminating the need for additional time-consuming trips to main memory.

Taking advantage of spatial and temporal locality results in improved system performance and better utilization of the memory bandwidth. But doing this requires writing code with mechanical sympathy. Let’s understand with some examples.

These caches in the CPU improve the system performance only when they are used effectively. They are designed anticipating temporal and spatial locality, but if things are more random, they become ineffective.

Taking advantage of the caches requires careful consideration of the data structure layout and memory access patterns. Let’s take a look at a few examples.

If the restaurant did not store the ingredients carefully to optimize for spatial locality, they would be bringing in random sets of ingredients onto the kitchen floor. This would hamper the performance in two ways:

Consider the following example of data structure layout with poor spatial locality. We have a GameEntity object which contains metadata about the entity such as its name and description, and then data about its current position and orientation.

Usually the player coordinates and orientation are accessed together and more frequently, while the metadata is accessed very rarely. The definition shown below will result in a lot of cache misses because the position_x and position_y fields lie in two different cache lines but the code will usually access them together. It will result in at least one cache miss and an extra memory access on every frame update.

/* Poor layout - frequently accessed fields scattered across cache lines */
struct GameEntity {
  int id; // 4 bytes
  char name[64]; // 64 bytes
  float position_x; // 4 bytes - accessed every frame
  char description[256]; // 256 bytes
  float position_y; // 4 bytes - accessed every frame
  char model_path[128]; // 128 bytes
  float position_z; // 4 bytes - accessed every frame
  float rotation; // 4 bytes - accessed every frame
};

When position_x, position_y, position_z, and rotation are accessed together during each physics update (which happens thousands of times per second), the CPU must load 4 different cache lines despite only needing 16 bytes of actual data. A more cache friendly layout for this object is shown below:

/* Cache-friendly layout - group fields by access frequency */
struct GameEntity {
  /* Hot path data (fits in one cache line) */
  float position_x; // 4 bytes
  float position_y; // 4 bytes
  float position_z; // 4 bytes
  float rotation; // 4 bytes
  int id; // 4 bytes

  /* Cold data - rarely accessed during main game loop */
  char name[64]; // 64 bytes
  char description[256]; // 256 bytes
  char model_path[128]; // 128 bytes
};

Now all frequently accessed fields are in a single 64-byte cache line, reducing memory traffic by ~75% during physics updates. This simple reorganization can yield 20-30% performance improvements in game engines and similar performance-critical applications.

Another situation where you can apply your understanding of cache mechanics is in optimizing your data access patterns - the specific ways your code retrieves and manipulates data. Let's illustrate this concept using our restaurant analogy.

Consider a variety of orders coming in: pizzas, sandwiches, fries, and so on. If these orders are sent to the chef randomly without any strategic organization, efficiency suffers. The chef might prepare a pizza, then switch to a burger, followed by fries, and later return to making another pizza. This approach poorly utilizes the "ingredients cache." When the first pizza was prepared, all pizza ingredients were readily accessible, and making a second pizza immediately would have been much faster. Instead, after the lengthy gap, those pizza ingredients may have been put away, requiring the chef to retrieve them all over again.

A more efficient approach would be to batch similar orders together. This strategy ensures that once the necessary ingredients are gathered (the cache is "primed"), they can be fully utilized to complete multiple similar orders quickly, significantly reducing overall preparation time.

Software applications frequently encounter similar inefficiencies. Developers often overlook their data access patterns when designing code. A typical scenario involves reading a specific value, which triggers the loading of an entire 64-byte cache line. However, instead of immediately utilizing the neighboring data now readily available in cache, the program executes unrelated operations. By the time the code requires additional values from that same cache line, it may have already been evicted, forcing another costly memory fetch.

As a concrete example, consider the process of iterating through a multidimensional array. In memory, these arrays are stored in row-major order, which means all elements of a row are placed sequentially before beginning the next row. The diagram below illustrates how a 3×4 two-dimensional array is arranged in memory. Each array element occupies 16 bytes, and consequently, one complete row precisely fills a single cache line.

Row-major storage of a 3x4 2d-array in memory. Each element consumes 16 bytes, thus one row of the array takes 64 bytes - equivalent of a single cache line.

You can iterate through such an array in two ways:

Column-wise iteration can significantly degrade performance due to inefficient cache utilization. In our example, each row of the array perfectly fills one cache line. When iterating column-wise, accessing a[0][0] brings its entire row (through a[0][3]) into the cache. However, the next operation accesses a[1][0], which resides in a different cache line, triggering a cache miss. This pattern continues, causing cache misses at virtually every step. Furthermore, with larger arrays, by the time the iteration returns to access a[0][1], the original cache line containing a[0][0] may have already been evicted from the cache.

Row-wise iteration avoids this problem entirely. After accessing a[0][0], the three subsequent values (a[0][1], a[0][2], and a[0][3]) are already present in the same cache line. This spatial locality dramatically reduces cache misses and improves overall performance.

The following diagram illustrates the cache hits for row wise and column wise iterations through it:

A visual illustration comparing the cache hits and misses of how row-wise vs column-wise traversal of a 2d-array

I will also note that in some cases optimizing compilers can detect such poor data access patterns and they can generate code that is cache friendly.

Also, modern processors have cache prefetching which means that the hardware detects the access patterns and starts to prefetch the cache lines before they are accessed to prevent cache misses. These are not covered in this article.

The final hardware optimization we are going to learn about is speculative execution. Let’s first understand what it is through the restaurant analogy.

At the drive-through, the restaurant notices a consistent pattern: customers quickly select their main item (burger, chicken sandwich, etc.) but then spend considerable time deliberating over sides and drinks from the many available options. It means that while the customer makes a decision, the order pipeline is stalled and the restaurant is losing on the potential order revenue.

To avoid this pipeline stall the restaurant implements a predictive system:

As soon as a customer orders a main item (e.g., "I'll have a cheeseburger..."), the kitchen begins preparing both the burger AND the most commonly ordered sides (fries)
This "speculative execution" happens while the customer is still deciding ("...and for my side, let me see...")
If the customer chooses fries (as predicted), their complete order is ready much faster
If they choose something else ("...I'll take onion rings instead"), the prepared fries are wasted and the correct side must be prepared from scratch. This misprediction causes extra delays to the delivery of the order, making the customer wait longer than usual.

The restaurant further refines this system by:

Tracking which sides are most commonly ordered with each main item
Noting regular customers' preferences to make better predictions
Analyzing time-of-day patterns (e.g., morning customers more likely to choose hash browns over fries)

However, this speculation comes with costs. Whenever the prediction is wrong, whatever they had on the order pipeline needs to be thrown away and they have to start from scratch to prepare the right item. This adds extra delay to the order.

The processor also does something similar in the form of speculative instruction execution. While the hardware can execute instructions very fast, fetching new instructions from memory takes time. To keep the execution resources in the processor busy, the instructions must be supplied at a fast rate.

If programs had a linear structure where one instruction followed another, this wouldn’t be a problem. The processor already prefetches sequential instructions and keeps them cached ahead of time to keep the pipeline fed.

However, program structure is not always linear. All non-trivial programs consist of branches in the form of if/else blocks, switch cases, loops, and function calls. For example, in the case of an if/else conditional block, the value of the condition decides which block of code needs to be executed next.

The branch condition itself involves one or more instructions which need to be executed before the processor knows which instructions to fetch next. If the processor waits for that to occur, the pipeline will not be fetching and decoding any new instructions until that time, resulting in a significant drop in the instruction throughput. And the performance penalty doesn’t end there - after the branching direction is determined and the location of the next instruction is known, the instructions from that address need to be fetched and decoded, which is another level of delay (especially if those instructions are not in the cache).

To avoid this performance degradation, the CPU implements branch predictors to predict the target address of branches. This is similar to how the restaurant used a predictive system to predict the side orders to keep the order pipeline busy.

But, whenever the branch predictor is wrong, there is a penalty on performance as well. When the processor finally evaluates the branch condition and finds that the branch predictor did a misprediction, it has to flush the pipeline because it was processing the wrong set of instructions. Once the pipeline is flushed, the processor starts to execute the right set of instructions. On modern X64 processors this misprediction can cause a performance penalty of ~20-30 cycles.

Modern processors have sophisticated branch predictors which can learn very complex branching patterns and offer accuracy of up to ~95%. But they still depend on predictable branching patterns in code. Let’s talk about how this knowledge about branch predictors helps us develop mechanical sympathy.

Just as the restaurant benefits when customer order patterns are predictable, our code performs best when branch patterns are predictable to the CPU.

Branch predictors are very sophisticated these days and explaining them is out of the scope. But it helps to have some intuition about how they work. Under the hood, the branch predictor tries to associate the jump result (jump taken/not taken) with the address of the branch instruction. It initially starts with a default guess (e.g. branch not taken) and then based on the actual branch condition result it updates itself. So if the branch is usually taken, the branch predictor eventually learns to predict that.

However, if the branching condition is random, the predictor will have a hard time learning it. For example, consider the following code:

int abs(int x) {
  if (x >= 0) {
    return x;
  }
  return -x;
}

Here, the result of the if condition is always going to be dependent on the value of x. If those values are random in nature, then the branch predictor will have a hard time predicting the branch. One solution is to convert such code into branch free code. For example:

static int abs(int x) {
  int y = x>>31;
  return (x^y)-y;
}

Your compiler and runtime may already do these kinds of optimizations behind the scenes. But, being aware of how branch prediction works helps us structure algorithms to avoid frequent mispredictions in critical loops. Just as a smart restaurant manager considers customer decision patterns when designing their workflow, a performance-conscious developer considers branch patterns when optimizing critical code paths.

Another example is when you have to process objects of different types in a loop. At each iteration of the loop you need to check the type of the object to decide how to handle its processing. If the object types are randomly distributed, then the branch predictor will have a hard time predicting the right outcome and the performance will suffer. For example:

for (Order o : orders) {
    switch (o.type) {
        case DOMESTIC_BULK:
            handleBulkOrder(o);
            break;

        case DOMESTIC_RETAIL:
            handleRetailOrder(o);
            break;

        case INTERNATIONAL_BULK:
            handleInternationalBulk(o);
            break;

        case INTERNATIONAL_RETAIL:
            handleInternationalRetail(o);
            break;

        // Add other cases as needed
        default:
            handleUnknownOrder(o);
            break;
    }
}

In such a case, it helps to first group the objects by their type and then have separate loops for each type of the object. That way there is no branching involved as you are handling objects of the same kind in the loop.

There are many such situations where the branch misprediction can become a bottleneck and various techniques exist that people utilize to improve the performance. While this article doesn’t have space to cover everything, hopefully it has made it clear the importance of mechanical sympathy and how it plays out in every line of code that we write.

Understanding how processors work "under the hood" is essential for writing truly performant code. As we've seen through our restaurant analogy, modern CPUs are complex systems with many optimizations that work best when code aligns with their expectations:

Instruction pipelining and superscalar execution enable multiple instructions to be processed simultaneously, but only when we provide independent operations
Memory hierarchy and caching dramatically reduce latency, but only when our data structures and access patterns exhibit good locality
Speculative execution keeps the pipeline full through branches, but becomes a performance liability with unpredictable branching patterns

Mechanical sympathy doesn't mean you need to write assembly code or understand every transistor. It means being aware of these fundamental hardware behaviors and how they interact with your higher-level code.

The most effective optimization approach is to:

Write clean, maintainable code first
Profile to identify performance bottlenecks
Apply mechanical sympathy principles specifically to those hot spots

As software engineers, we don't need to fight the hardware - we can work with it. When we align our algorithms and data structures with the grain of the processor rather than against it, we achieve that satisfying moment when a program that once took seconds now completes in milliseconds.

Huge thanks to Sanjeev Dwivedi in reviewing and providing feedback on this article. Sanjeev is a technology leader with vast experience in product management and leading high performance teams at companies such as VMWare and Microsoft. You can find him on Linkedin here.

Thanks for reading Confessions of a Code Addict! This post is public so feel free to share it.

Share

硬件感知编码：每个开发者都应该了解的CPU架构概念 Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know

硬件感知编码：每个开发者都应该了解的CPU架构概念
Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know