(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40326563

一篇已有十年历史的博客文章讨论了确保不同计算系统中数学运算(尤其是浮点数学)一致性的复杂性。 它提到了 x87 指令集的问题、设置舍入模式的重要性以及近似指令缺乏一致的规范。 IEEE-754 (2008) 等较新的标准已经澄清了这些问题,使 x87 变得过时,并使 FMA 成为标准功能。 然而,更广泛的矢量处理单元仍然存在差异,这可能会导致开发跨平台软件时出现不一致。 为了缓解这些问题,在开发和测试期间仔细考虑是至关重要的,特别是在常量和编译器优化方面。 虽然存在帮助检测此类差异的工具,但由于编译器实现的变化以及编译器优化的潜在副作用,不存在万无一失的解决方案。 该博客文章强调,要获得一致的结果,就需要在整个开发过程中额外关注细节。

相关文章

原文


While the author mentions this is mostly applicable to things like FPGAs, there's also an application in gamedev (or any distributed physics simulation). Floating point calculations are tricky to get deterministic across platforms[0]. One solution is to skip floats entirely and implement a fixed-point physics engine. You'd need something like CORDIC to implement all the trig functionality.

I started working on such a thing as a fun side project a few years ago but never finished it. One of these days I hope to get back to it.

[0]: https://randomascii.wordpress.com/2013/07/16/floating-point-...



That blog post is now a decade old, but includes an important quote:

> The IEEE standard does guarantee some things. It guarantees more than the floating-point-math-is-mystical crowd realizes, but less than some programmers might think.

Summarizing the blog post, it highlights a few things though some less clearly than I would like:

  * x87 was wonky

  * You need to ensure rounding modes, flush-to-zero, etc. are consistently set

  * Some older processors don't have FMA

  * Approximate instructions (mmsqrtps et al.) don't have a consistent spec

  * Compilers may reassociate expressions

For small routines and self-written libraries, it's straightforward, if painful to ensure you avoid all of that.

Briefly mentioned in the blog post is IEEE-754 (2008) made the spec more explicit, and effectively assumed the death of x87. It's 2024 now, so you can definitely avoid x87. Similarly, FMA is part of the IEEE-754 2008 spec, and has been built into all modern processors since (Haswell and later on Intel).

There are still cross-architecture differences like 8-wide AVX2 vs 4-wide NEON that can trip you up, but if you are writing assembly or intrinsics or just C that inspect with Compiler Explorer or objdump, you can look at the output and say "Yep, that'll be consistent".



> but if you are writing assembly or intrinsics or just C that inspect with Compiler Explorer or objdump, you can look at the output and say "Yep, that'll be consistent".

Surely people have written tooling for those checks for various CPUs?

Also, is it that ‘simple’? Reading https://github.com/llvm/llvm-project/issues/62479, calculations that the compiler does and that only end up in the executable as constants can make results different between architectures or compilers (possibly even compiler runs, if it runs multi-threaded and constant folding order depends on timing, but I find it hard to imagine how exactly that would happen).

So, you’d want to check constants in the code, too, but then, there’s no guarantee that compilers do the same constant folding. You can try to get more consistent by being really diligent in using constexpr, but that doesn’t guarantee that, either.



Years ago, I was programming in Ada and ran across a case where the value of a constant in a program differed from the same constant being converted at runtime. Took a while to track that one down.


The same reasoning applies though. The compiler is just another program. Outside of doing constant folding on things that are unspec'ed or not required (like mmsqrtps and most transcendentals), you should get consistent results even between architectures.

Of course the specific line linked to in that GH issue is showing that LLVM will attempt constant folding of various trig functions:

https://github.com/llvm/llvm-project/blob/faa43a80f70c639f40...

but the IEEE-754 spec does also recommend correctly rounded results for those (https://en.wikipedia.org/wiki/IEEE_754#Recommended_operation...).

The majority of code I'm talking about though uses constants that are some long, explicit number, and doesn't do any math on them that would then be amenable to constant folding itself.

That said, lines like:

https://github.com/llvm/llvm-project/blob/faa43a80f70c639f40...

are more worrying, since that may differ from what people expect dynamically (though the underlying stuff supports different denormal rules).

Either way, thanks for highlighting this! Clearly the answer is to just use LLVM/clang regardless of backend :).



The author did mention about fixed point was very popular for gamedev before floating point becoming popular due the increased in hardware capability, and most likely CORDIC was being used as well together with fixed point.

> In fact, before IEEE 754 became the popular standard that it is today, fixed point was used all the time (go and ask any gamedev who worked on stuff between 1980 and 2000ish and they'll tell you all about it).



This is a common misconception, but is not the case. For example, look at the Voodoo 1, 2, and 3, which also used fixed point numbers internally but did not suffer from this problem.

The real issue is that the PS1 has no subpixel precision. In other words, it will round a triangle coordinates to the nearest integers.

Likely the reason why they did this is because then you can completely avoid any division and multiplication hardware, with integer start and end coordinates line rasterization can be done completely with addition and comparisons.



Fun facts, in addition to sine and cosine calculation and generation, CORDIC can also be used for many operations including log, exponent, square root, vector magnitude, polar-cartesian conversion and vector rotation. Spoiler alerts, the authors provided the teaser regarding these propects in the conclusions.

I've got the feeling that by using quaternion instead of conventional orthonormal matrices, CORDIC based operations can be executed more efficiently (less compute cycles and memory) and effectively (less errors) [1].

[1] CORDIC Framework for Quaternion-based Joint Angle Computation to Classify Arm Movements:

https://core.ac.uk/works/8439118



I remember in high school precalc, we learned about the Taylor series, and my teacher told us that it was how trig functions were actually implemented on calculators. Well I looked it up and found that it was actually CORDIC, and went and had some fun implementing it in TI Basic


I've done some work with fast approximate math libraries,and went down this path. Taylor was a dead-end, but Chebyshev polynomials work great. They have the nice property of having (close to) the minimum maximum (minimax) error over the entire range you are approximating.


Side note for 2023: Some modern MCUs are low cost, but have FPUs. A good example is the STM32G4. So, unlike on an M0 MCU or w/e, you can freely use f32, if you don't want fixed point. And you can get these for ~$1-2 / MCU.

That said... The G4 also has a hardware CORDIC peripheral that implements this algo for fixed-point uses. Is this mainly to avoid floating point precision losses? You program it using registers, but aren't implementing CORDIC yourself on the CPU; dedicated hardware inside the IC does it.



The second Parallax Propeller chip has a CORDIC engine in silicon. It's fast, and handles 64bit intermediate products, which makes the divide and trig stuff more than adequate precision for most things. One can always increase precision in software too.

I was late to the game learning about CORDIC. I had used fixed point a lot in 8 and 16 bit assembly land for performance, and determinism.

When I found out about it, I was stunned! It was fast, and only required basic math capability to be useful.



sin and cos are often used to rotate vectors. In these cases, a trick with a CORDIC is to avoid the traditional sin/cos/multiply calculation by using the vector to be rotated as the input to the CORDIC. The CORDIC will directly produce the rotated vector, without having to compute sin/cos or do a complex multiplication.

CORDIC really shines when latency doesn't matter so much. You can pipeline every stage of the computation and get a huge throughput. It's well suited for digital mixing in radio systems.



> Now, it's fairly obvious that rotating by say 22.75˚ is the same as rotating by 45˚ and then -22.5˚ - i.e. we can break up a rotation into smaller parts, with both positive and negative components.

Wouldn't that be rotating by 22.5°? Is this an error in the article or my understanding?



It is not, because the subdivisions in CORDIC do not possess the best approximation properties of Farey fractions. For that, you would have to partition the circle into major/minor arcs instead, in the sense of Hardy-Littlewood's circle method. But that would be computationally very expensive, whereas this binary subdivision is fast, and is made faster using the shift-add tricks.


That’s super cool

Thank you for that link, very interesting diagrams and structures

They look very similar to what a neural network might be in certain cases, and it seems like the structures are trying to map some sort of duality



Meagher's octree system notably used only integer arithmetic with no integer multiplication/division:

> Efficient (linear time) algorithms have been developed for the Boolean operations (union, intersection and difference), geometric operations (translation, scaling and rotation), N-dimensional interference detection, and display from any point in space with hidden surfaces removed. The algorithms require neither floating-point operations, integer multiplications, nor integer divisions.

https://doi.org/10.1016/0146-664X(82)90104-6

This facilitated creation of fast, custom VLSI graphics acceleration hardware for octree representation.



I got reminded of a rather cute snippet of code I had a part in.

One needed to find the coordinates of the bisector of an angle subtended by an arc of a unit circle. They (x,y) coordinates of the 'arms' were available. The existing implementation was a mass of trigonometry - going from the x,y coordinates to polar (r,θ), then making sure that the θ computed was in the correct quadrant, halving the θ and converting back to x,y coordinates. All in all, many calls to trigonometric functions and inverse functions.

This was in Python and thus we had first class access to complex numbers. So all that was needed was to define two complex numbers. z1 from (x1,y1) z2 from (x2,y2) and then take the geometric mean of the product: √(z1*z2). Done.

No explicit trigonometry and no explicit conversion and back in the new code.



I have a few vintage programmable HP calculators that implement CORDIC on their 4-bit CPUs. You can also program them to calculate the Taylor expansion of sin() as another approximate method.

If you liked this, it’s worth reading Donald Knuth seminal “The Art of Computer Programming” which explains a number of mathematical algorithms by example.



From the neural network Wikipedia page:

> The signal each neuron outputs is calculated from this number, according to its activation function

The activation function is usually a sigmoid, which is also usually defined in terms of trigonometric functions

Which neural networks don’t use trigonometric functions or equivalent?



In current neural networks the activation function usually is not sigmoid but something like ReLU ( y = 0 if x<0 else x ), and in any case the computation of activations is not meaningful part of total compute - for non-tiny networks, almost all the effort is spent on the the large matrix multiplications of the large layers before the activation function, and as the network size grows, they become even less relevant, as the amount of activation calculations grows linearly with layer size but the core computation grows superlinearly (n^2.8 perhaps?).


Yes, some non-linearity is important - not for Turing completeness, but because without it the consecutive layers effectively implement a single linear transformation of the same size and you're just doing useless computation.

However, the "decision point" of the ReLU (and it's everywhere-differentiable friends like leaky ReLU or ELU) provides a sufficient non-linearity - in essence, just as a sigmoid effectively results in a yes/no chooser with some stuff in the middle for training purposes, so does the ReLU "elbow point".

Sigmoid has a problem of 'vanishing gradients' in deep networks, as the sigmoid gradients of 0 - 0.25 in standard backpropagation means that a 'far away' layer will have tiny, useless gradients if there's a hundred sigmoid layers in between.



Most activations functions are based on exponential functions.

I don't immediately see how cordic would help you calculate e^(x) unless x is complex valued (which it is usually not).



“Lives rent-free in my head” is a horrible cliche. By now, it has the same effect as saying “it’s nice” but uses lots more letters.


And what does this pedantry solve, exactly? It clearly is something the author thinks often enough about and enjoys talking about.

Let people enjoy things.



the "rent" in the saying is the effort to learn and upkeep a certain piece of knowledge in your head, to the point where you can recall it as needed

living rent free generally means that you are able to easily recall a piece of knowledge (and often do, even if you don't want to).

something that pays rent might be like an important concept for an upcoming school test, for a class you're not interested in.



I always thought rent was utility, and a piece of knowledge or an opinion that didn't pay rent was one that provided little or no practical benefit to you.

The term in its original use described an obsession with a particular group or thing and called it out as pointless. I don't think it would make sense for rent to mean effort in this original sense, nor in the broader way it's used here.



I have no idea what pseudo-intellectual discourse this is trying to be but the phrase "X lives rent free in my head" means only that "I think of X a lot because I enjoy thinking about it". Nothing more.


> “Lives rent-free in my head” is a horrible cliche.

Yes it is, much like "dumpster fire" or "toxic" or "gaslighting" or any other contemporary Reddit-isms that are associated with hyper-reductionist cultural hot takes. My personal distaste for them, however, has no bearing on the fact that they have very real meaning and people enjoy using them.

> By now, it has the same effect as saying “it’s nice” but uses lots more letters.

The commonly-held meaning of this is that it describes a pervasive, intrusive thought, not a nice one. This particular turn of phrase is often used to describe how politicians devote energy to talking about other politicians that they don't like, for instance.

联系我们 contact @ memedata.com