向量化:趣味与性能
Vectorizing for Fun and Performance

原始链接: https://www.ibm.com/support/pages/vectorizing-fun-and-performance

## IBM Power 处理器与向量化:性能提升 IBM Power 处理器包含一个强大但常常未被充分利用的特性,称为向量处理(AltiVec、VMX、VSX),它支持单指令多数据 (SIMD) 操作。 这允许单个指令同时对多个数据点执行相同操作——例如,一条指令执行四个浮点数加法。 虽然编译器会尝试“自动向量化”,但显式实现通常能产生更好的结果,利用数据类型和函数(例如带有 `altivec.h` 头文件和 `-mvsx` 标志的 `vec_add`)来访问这些能力。 文章通过在数组中搜索最大值来演示这一点。 一种简单的迭代方法很简单,但不能完全利用处理器。 一种向量化方法,使用 `vec_max` 比较向量,可以显著提高性能,*尤其*是在处理大型数据集时。 最初在 32 个元素数组上的测试显示差异不大,但扩展到 32 个浮点数后,速度得到了显著提升——向量化代码比传统方法快得多。 成功利用向量化需要仔细的代码设计,但对于计算密集型任务,潜在的性能提升使其成为一项值得的优化工作。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 向量化以获得乐趣和性能 (ibm.com) 5 分,由 rinostroh 1 小时前发布 | 隐藏 | 过去 | 收藏 | 讨论 考虑申请YC冬季2026批次!申请截止至11月10日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

By: Paul Clarke.

The IBM Power processor family has a somewhat hidden little gem which, when exploited, could yield signficant performance improvements.  IBM Power processors have a vector processing facility (known as AltiVec, VMX, and VSX in different incantations) which can perform multiple computations with a single instruction, also known as SIMD (single instruction, multiple data).  POWER8 has 64 "vector-scalar registers' (VSRs), the first 32 of which are shared space with the 32 floating point registers.  Each VSR is 128 bits.  Thus, each VSR can hold 2 double-precision (64 bits each) or 4 single-precision (32 bits each) floating point quantities.  Vector instructions operate on all of the quantities simultaneously.  Among many other capabilities, there are instructions for:

  • vector add, subtract, multiply, divide
  • vector multiply-add
  • vector square root

With vector multiply-add, for example, it is thus possible to perform 4 single-precision floating point multiplications and 4 single-precision floating point additions all with a single instruction.

Compilers generally have some support for "auto-vectorization," but compilers also have to be conservative in exploiting the feature due to unknowns like alignment or aliasing which cannot be determined at compilation time.  Thus, it is often necessary to implement vectorization explicitly in order to exploit it.

Fortunately, the compilers provide data types and functions which allow higher level languages like C to exploit the vector capabilities of the processor without having to write in assembly language.  The "vector add" instruction can be invoked by calling the "vec_add" function (and including the "altivec.h" header file and using the "-mvsx" compiler flag).

As an example, straightforward code to determine the maximum value in an array of 32 single-precision floating point numbers might look like this:

max = array[0];
for (i = 1; i < 32; i++) {
    if (max < array[i]) max = array[i];
}

This code is very compact and understandable.  However, it does not fully exploit the capabilities of a vector-enabled procesor.

Note that there are no VSX instructions which take a vector and return the maximum value of the elements of the vector.  However, there is an instruction that will take two vectors and return a vector with the maximums of the corresponding elements.  While this will obviously not give the final answer, it sure can help to get the answer quicker.

We can successively load up pairs of vectors, and compute the maximums of the respective elements, thereby performing 4 compares and no conditionals in a single instruction.  After that first "vector maximum", we will have eliminated 4 elements from consideration.  If we do this carefullly, by re-using vectors that have been elminated, we can avoid needing a larger set of registers.

Let's start with an array of 8 single-precision floating point numbers.  The code for a vector approach might look like this:

/* include "altivec.h" to get the data types and functions */
#include <altivec.h>

/* take an array of 8 floats, return the max */
float max8(const float *data) {
    __vector float v[2]; /* the "working" vectors */
    float m0, m1; /* some helpers for the final answer */

    /* load the vectors by simple unitary assignments,
     * which should translate to a single "vector load" instruction each
     */
    v[0] = *(__vector float *)(&data[0]);
    v[1] = *(__vector float *)(&data[4]);

    /* compare the two vectors
     * return a vector of the maximums of the individual elements
     * note that we re-use v[0]
     */
    v[0] = vec_max(v[0],v[1]);

    /* now dig into the resulting "max" vector
     * and find the maximum element explicitly
     */
    m0 = v[0][0] > v[0][1] ? v[0][0] : v[0][1];
    m1 = v[0][2] > v[0][3] ? v[0][2] : v[0][3];

    return m0 > m1 ? m0 : m1;
}

Note that the code is significantly more complex.

If we add some iterations and time stamping around that function, we can look at its performance.  Avoiding specifics, since this is only an example, here are some performance results:

mode time
non-vectorized 3.693498
vectorized 3.744294

There's not much difference.  A quick analysis shows that the "simple" code will do 7 floating point comparisons, and 7 conditionals.  The vectorized code will do 1 vector compare, 3 floating point comparisons, plus 4 vector to scalar floating point conversions for the final step.  Without deeper analysis, it could be that this roughly evens out.  Let's try to scale our example up.

For an array of 32 floating point numbers, the non-vectorized code will look exactly the same.  If we continue to avoid loops, the vectorized code will look like this (edited slightly for brevity):

float max32(const float *data) {
    __vector float v[4]; /* the "working" vectors */
    float m0, m1; /* some helpers for the final answer */

    v[0] = *(__vector float *)(&data[0]);
    v[1] = *(__vector float *)(&data[4]);

    v[0] = vec_max(v[0],v[1]);

    v[1] = *(__vector float *)(&data[8]);
    v[2] = *(__vector float *)(&data[12]);

    v[1] = vec_max(v[1],v[2]);

    v[0] = vec_max(v[0],v[1]);

    v[1] = *(__vector float *)(&data[16]);
    v[2] = *(__vector float *)(&data[20]);

    v[1] = vec_max(v[1],v[2]);

    v[2] = *(__vector float *)(&data[24]);
    v[3] = *(__vector float *)(&data[28]);

    v[2] = vec_max(v[2],v[3]);

    v[1] = vec_max(v[1],v[2]);

    v[0] = vec_max(v[0],v[1]);

    m0 = v[0][0] > v[0][1] ? v[0][0] : v[0][1];
    m1 = v[0][2] > v[0][3] ? v[0][2] : v[0][3];

    return m0 > m1 ? m0 : m1;
}

Note again that the code is fairly complex.

Here are some performance results for an array of 32 floats:

mode time
non-vectorized 474.958120
vectorized 3.733729

A very significant difference.

A quick analysis for a 32 float array has the "simple" code with 31 floating point comparisons and 31 conditionals.  The vectorized code has 7 vector comparisons, 3 float compares, 3 conditionals, and 4 vector to scalar conversions.

With some effort, converting computationally intensive code to incorporate vectorization can have significant performance benefits.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

联系我们 contact @ memedata.com