（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40081314

在本文中，作者强调了优化调用约定时衡量性能的重要性。虽然某些代码可能看起来不太理想，但由于未知原因，它确实可能运行得更快。寄存器保护和与 CPU 优化保持一致是有助于高效调用约定的关键因素。现代 CPU 更喜欢 C 编译器跟踪模式，这意味着与 C 编译器类似的编码可以提高性能。然而，设计者应该平衡短期收益与长期适应性和潜在限制。作者分享了 JavaScriptCore 的工作经验，提到了关于最佳调用约定的频繁惊喜。此外，他们还讨论了绩效衡量问题，并指出它们不应仅仅指导决策。其他考虑因素，例如编译器的演变和体系结构兼容性，也应该纳入调用约定设计中。

The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.

Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.

One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.

Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.

Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.

Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.

(Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)

I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.

That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.

You said it yourself: "Another reason is that the CPUs of today are optimized [..]"

The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.

Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.

Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.

[1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.

The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.

I know, but from what I understand there are initiatives to stabilize the ABI, which would also mean stabilizing calling conventions. I read the article in that broader context, even if it does not talk about that directly.

There's no proposal to stabilize the Rust ABI. There are proposals to define a separate stable ABI, which would not be the default ABI. (Such a separate stable ABI would want to plan for long-term performance, but the default ABI could continue to improve.)

There is already a separate stable ABI, it's just the C ABI. There are also multiple crates that address the issue of stable ABIs for Rust code. It's not very clear why compiler involvement would be required for this.

If I remember correctly there is a bit of difference between explicit `extern "rust"` and no explicit calling convention but I'm not so sure.

Anyway at least when not using explicit rust representation Rust doesn't even guarantee that the layout of a struct is the same for two repeated build _with the same compiler and code_. That is very intentionally and I think there is no intend to change that "in general" (but various subsets might be standarized, like `Option<&T> where T: Sized` mapping `None` to a null pointer allowing you to use it in C-FFI is already a de facto standard). Which as far as I remember is where explicit extern rust comes in to make sure that we can have a prebuild libstd, it still can change with _any_ compiler version including patch versions. E.g. a hypothetical 1.100 and 1.100.1 might not have the same unstable rust calling convention.

> and a calling convention should be designed for the long term

...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).

> means that it is beneficial to not deviate too much from what C++ does

Or just C.

Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .

Huh, I thought they fixed that (the PCMPISTR? string instructions from SSE4.2 being significantly faster than PCMPESTR?), but looks like the explicit-length version still takes twice as many uops on recent Intel and AMD CPUs. They don’t seem to get much use nowadays anyway, though, what with being stuck in the 128-bit realm (there’s a VEX encoding but that’s basically it).

Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.

It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.

Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.

LLVM inlines even more than my JIT does.

The JIT has both more and less information.

It has more information about the program globally. There is no linking or “extern” boundary.

But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.

One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.

(I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)

And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.

The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.

There were proposals for optimizing this kind of stuff for C++ in particular for error handling, like:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p07...

> Throwing such values behaves as-if the function returned union{R;E;}+bool where on success the function returns the normal return value R and on error the function returns the error value type E, both in the same return channel including using the same registers. The discriminant can use an unused CPU flag or a register

IBM BIOS and MS DOS calls used the carry flag as a boolean return or error indicator (the 8086 has instructions for manually setting and resetting it). I don’t think people do that these days except in manually-coded assembly, unfortunately (which the relevant parts of both BIOS and DOS also were of course).

Also a thing you gotta measure.

Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.

"If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."

The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.

Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?

Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

Changing ABI with optimization setting interacts really badly with separate compilation.

Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.

The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.

Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.

> This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.

Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.

You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.

The LLVM calling conventions for x86 only allow returning 3 integer registers, 4 vector registers, and 2 x87 floating point registers (er, stack slots technically because x87 is weird).

Sure. That would be an instance of the "usual error". The argument registers are usually caller save, where any unused ones get treated as scratch in the callee, in which case making them all available for returning data as well is zero cost.

There's no reason not to, other than C makes returning multiple things awkward and splitting a struct across multiple registers is slightly annoying for the compiler.

Limiting a newly designed Rust ABI to whatever LLVM happens to support at the moment seems unnecessarily limiting. Yeah, you'd need to write some C++ to implement it, but that's not the end of the world, especially compared to getting stuck with arbitrary limits in your ABI for the next decade or two.

I stared at this really hard, and I eventually couldn't figure out what you mean here.

Obviously naively just dividing integers by zero in Rust will panic, because that's what is defined to happen.

So you have to be thinking about a specific case where it's defined not to panic. But, what case? There isn't an unchecked_div defined on the integers. The wrapping and saturating variants panic for division by zero, as do the various edge cases like div_floor

What case are you thinking of where "integer division by 0 is UB in Rust" ?

The poster is both correct and incorrect. It definitely is true that LLVM only has two instructions to deal with division, udiv and sdiv specifically, and it used to be the case that Rust as a consequence had UB when encountering division by zero as a result as those two instructions consider that operation UB.

But Rust has solved this problem by inserting a check before every division that reasonably could get a division by zero (might even be all operations, I don't know the specifics), which checks for zero and defines the consequences.

So as a result divisions aren't just divisions in Rust, they come with an additional check as overhead, but they aren't UB either.

Sure, and if you actually want a branchless integer division for an arbitrary input, which is defined for the entire input domain on x64, then to get it you'll have to pull some trick like reinterpreting a zeroable type as a nonzero one, heading straight through LLVM IR UB on your way to the defined behavior on x64.

Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?

And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.

In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.

If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).

I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.

Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.

I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.

I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.

Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.

> It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

DWARF doesn't encode bespoke calling conventions at all today.

The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.

And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.

Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).

Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.

Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option, where the packing would reduce needed register/stack slot count by ~2.

Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.

> Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option, where the packing would reduce needed register/stack slot count by ~2

I don't have a strong sense of how much more common owned `Option` types are than references, but it's worth noting that if `T` is a reference, `Option` will just use a pointer and treat the null value as `None` under the hood to avoid needing any tag. There are probably other types where this is done as well (maybe `NonZero` integer types?)

Yeah, `NonZero*` but also a type like `#[repr(u8)] enum Foo{ X }`, according to `assert_eq!(std::mem::size_of::(), std::mem::size_of::())` you need an enum which fully saturates the repr, e.g. `#[repr(u8)]Bar { X0, ... X255}` (pseudo code) before niche optimization fails to kick in.

Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?

More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo

Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.

Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.

Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.

Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.

(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.

It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.

Due to speculative out-of-order execution, it's not just "a few cycles". The LSU has a hard, small, limit on the number of outstanding loads and stores (usually separate limits, on the order of 8-32) and once you fill that, you have to stop issuing until commit has drained them.

This discussion is yet another instance of the fallacy of "Intel has optimized for the current code so let's not improve it!". Other examples include branch prediction (correctly predicted branch as a small but not zero cost) and indirect jump prediction. And this doesn't even begin to address implementations that might be less aggressive about making up for bad code (like most RISCs and RISC-likes).

Tangentially related, there's another "unfortunate" detail of Rust that makes some structs bigger than you want them to be. Imagine a struct Foo that contains eight `Option` fields, ie each field is either `None` or `Some(u8)`. In C, you could represent this as a struct with eight 1-bit `bool`s and eight `uint8_t`s, for a total size of 9 bytes. In Rust however, the struct will be 16 bytes, ie eight sequences of 1-byte discriminant followed by a `uint8_t`.

Why? The reason is that structs must be able to present borrows of their fields, so given a `&Foo` the compiler must allow the construction of a `&Foo::some_field`, which in this case is an `&Option`. This `&Option` must obviously look identical to any other `&Option` in the program. Thus the underlying `Option` is forced to have the same layout as any other `Option` in the program, ie its own personal discriminant bit rounded up to a byte followed by its `u8`. The struct pays this price even if the program never actually constructs a `&Foo::some_field`.

This becomes even worse if you consider Options of larger types, like a struct with eight `Option` fields. Then each personal discriminant will be rounded up to two bytes, for a total size of 32 bytes with a quarter (or almost half, if you include the unused bits of the discriminants) being wasted interstitial padding. The C equivalent would only be 18 bytes. With `Option`, the Rust struct would be 128 bytes while the C struct would be 72 bytes.

You *can* implement the C equivalent manually of course, with a `u8` for the packed discriminants and eight `MaybeUninit`s, and functions that map from `&Foo` to `Option<&T>`, `&mut Foo` to `Option<&mut T>`, etc, but not to `&Option` or `&mut Option`.

https://play.rust-lang.org/?version=stable&mode=debug&editio...

You have to implement the C version manually, so it's not that odd you'd need to do the same for Rust?

You've described, basically, a custom type that is 8 Optionss. If you start caring about performance you'll need to roll your own internal Option handling.

>You have to implement the C version manually

There's no "manually" about it. There's only one way to implement it in C, ie eight booleans and eight uint8_ts as I described. Going from there to the further optimization of adding a `:1` to every `bool` field is a simple optimization. Reimplementing `Option` and the bitpacking of the discriminants is much more effort compared to the baseline implementation of using `Option`.

The alternative is `std::optional` which works exactly the same as Rust's `Option` (without the niche optimisation).

I'm not a C programmer but I imagine you could make something like `std::optional` in C using structs and macros and whatnot.

> You can implement the C equivalent manually of course

But you have to implement the C version manually as well.

It's not really a downside to Rust if it provides a convenient feature that you can choose to use if it fits your goals.

The use case you're describing is relatively rare. If it's an actual performance bottleneck then spending a little extra time to implement it in Rust doesn't seem like a big deal. I have a hard time considering this an "unfortunate detail" to Rust when the presence of the Option<_> type provides so much benefit in typical use cases.

> If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.

If the legacy shim tail calls the Rust-calling-convention function, won't that prevent it from fixing any return value differences in the calling convention?

I just spent a bunch of time on inspect element trying to figure out how the section headings are set at an angle and (at least with Safari tools), I’m stumped. So how did he do this?

h1, h2, h3, h4, h5, h6 { transform:skewY(-2deg) translate(-1rem,0rem); transform-origin:top; font-style:italic; text-decoration-line:underline; text-decoration-color:goldenrod; text-underline-offset:4%; text-decoration-thickness:.25ex }

In contrast: "How Swift Achieved Dynamic Linking Where Rust Couldn't " (2019) [1]

On the one hand I'm disappointed that Rust still doesn't have a calling convention for Rust-level semantics. On the other hand the above article demonstrates the tremendous amount of work that's required to get there. Apple was deeply motivated to build this as a requirement to make Swift a viable system language that applications could rely on, but Rust does not have that kind of backing.

[1] https://faultlore.com/blah/swift-abi/

HN discussion: https://news.ycombinator.com/item?id=21488415

Notably these runtime costs only occur if you’re calling into another library. For calls within a given swift library, you don’t incur the runtime costs: size checks are elided (since size is known), calls can be inlined, generics are monomorphized… the costs only happen when you’re calling into code that the compiler can’t see.

Tangentially related: Is it currently possible to have interop between Go and Rust? I remember seeing someone achieving it with Zig in the middle but can’t for the sake of me find it. Have some legacy Rust code (what??) that I’m hoping to slowly port to Go piece by piece

Yes, you can use CGO to call Rust functions using extern "C" FFI. I gave a talk about how we use it for GitHub code search at RustConf 2023 (https://www.youtube.com/watch?v=KYdlqhb267c) and afterwards I talked to some other folks (like 1Password) who are doing similar things.

It's not a lot of fun because moving types across the C interop boundary is tedious, but it is possible and allows code reuse.

If you want to call from Go into Rust, you can declare any Rust function as `extern "C"` and then call it the same way you would call C from Go. Not sure about going the other way.

It's usually unwise to mix managed and unmanaged memory since the managed code needs to be able to own the memory its freeing and moving whereas the unmanaged code needs to reason about when memory is freed or moved. cgo (and other variants) let you mix FFI calls into unmanaged memory from managed code in Go, but you pay a penalty for it.

In language implementations where GC isn't shared by the different languages calling each other you're always going to have this problem. Mixing managed/unmanaged code is both an old idea and actively researched.

It's almost always a terrible idea to call into managed code from unmanaged code unless you're working directly with an embedded runtime that's been designed for it. And when you do, there's usually a serialization layer in between.

> It's usually unwise to mix managed and unmanaged memory

Broadly stated, you can achieve this by marking a managed object as a GC root whenever it's to be referenced by unmanaged code (so that it won't be freed or moved in that case) and adding finalizers whenever managed objects own or hold refcounted references over unmanaged memory (so that the unmanaged code can reason about these objects being freed). But yes, it's a bit fiddly.

Mixing managed and unmanaged code being an issue is simply not true in programming in general.

It may be an issue in Go or Java, but it just isn't in C# or Swift.

Calling `write` in C# on Unix is as easy as the following snippet and has almost no overhead:

    var text = "Hello, World!\n"u8;
    Interop.Write(1, text, text.Length);

    static unsafe partial class Interop
    {
        [LibraryImport("libc", EntryPoint = "write")]
        public static partial void Write(
            nint fd, ReadOnlySpan buffer, nint length);
    }

In addition, unmanaged->managed calls are also rarely an issue, both via function pointers and plain C exports if you build a binary with NativeAOT:

    public static class Exports
    {
        [UnmanagedCallersOnly(EntryPoint = "sum")]
        public static nint Sum(nint a, nint b) => a + b;
    }

It is indeed true that more complex scenarios may require some form of bespoke embedding/hosting of the runtime, but that is more of a peculiarity of Go and Java, not an actual technical limitation.

That's not the direction being talked about here. Try calling the C# method from C or C++ or Rust.

(I somewhat recently did try setting up mono to be able to do this... it wasn't fun.)

I haven't been looking for those because I don't work with .NET. Regardless, what you're linking still needs callers and callees to agree on calling convention and special binding annotations across FFI boundaries which isn't particularly interesting from the perspective of language implementation like the promises of Graal or WASM + GC + component model.

There is no free lunch. WASM just means another lowest common denominator abstraction for FFI. I'm also looking forward to WASM getting actually good so .NET could properly target it (because shipping WASM-compiled GC is really, really painful, it works acceptably today, but could be better). Current WasmGC spec is pretty much unusable by any language that has non-primitive GC implementation.

Just please don't run WASM on the server, we're already getting diminishing generational performance gains in hardware, no need to reduce them further.

The exports in the examples follow C ABI with respective OS/ISA-specific calling convention.

There are more managed langauges than Go, Java, and C#. Swift (and Objective C with ARC) are a bit different in that they don't use mark and sweep/generational GCs for automatic memory management so it's significantly less of an issue. Compare with Lua, Python, JS, etc where there's a serialization boundary between the two.

But I stand by what I said. It's generally unwise to mix the two, particularly calling unmanaged code from managed code.

I wouldn't say it's "not a problem" because there are very few environments where you don't pay some cost for mixing and matching between managed/unmanaged code, and the environments designed around it are built from first principles to support it, like .NET. More interesting to me are Graal and WASM (once GC support lands) which should make it much easier to deal with.

Except that is only true since those attributes were introduced in recent .NET versions, and it doesn't account for COM marshaling issues.

Plenty of .NET code still using the old ways that isn't going to be rewritten, either for these attributes, or the new Cs/WinRT, or the new Core COM interop, which doesn't support all COM use cases anyway.

Code written for .NET Framework is completely irrelevant to conversation since it does not evaluate it.

You should treat it as dead and move on because it does not impact what .NET can or can’t do.

There is no point to bring up “No, but 10 years ago it was different”. So what? It’s not 2014 anymore.

My remarks also apply to modern .NET, as those improvements were introduced in .NET 6 and .NET 8, and require a code rewrite to adopt them, instead of the old ways which are also available, in your blind advocacy you happened to miss out.

Very few code gets written from scratch unless we are talking about startups.

I was expecting this pedantic comment... If refcounting makes a language "managed", then C++ with shared_ptr is also "managed".

_______

The charitable interpretation is that OP was likely referring to the issues when calling into a language with a relocating GC (because you need to tell the GC not to move objects while you're working with them), which Swift is not.

Nope, because that is a library class without any language support.

The pedantic comment is a synonymous with proper education instead of street urban myths.

It is a library class, because C++ is a rich enough language to implement automatic refcounting as a library class, by hooking into the appropriate lifecycle methods (copy ctor, dtor).

Swift has just as much concerns for its structs and classes passing across FFI in terms of marshalling/unmarshalling and ensuring the ARC-unaware code performs either manual retain/release calls or adapts them to whatever other mechanism of memory management of the callee.

One of the comments here mentions that Swift has its own stable ABI, which exposes richer type system, so it does stand out in terms of interop (.NET 9 will add support for it natively (library evolution ABI) without having to go through C calls or C "glue" code on iOS and macOS, maybe the first non-Swift/ObjC platform to do so?).

Object pinning in .NET is only a part of equation and at this point far from the biggest concern (it just works, like it did 15 years ago, maybe it's a matter of fuss elsewhere?).

I have to use Rust and Swift quite a bit. I basically just landed on sending a byte array of serialized protobufs back and forth with cookie cutter function calls. If this is your full time job I can see how you might think that is lame, but I really got tired of coming back to the code every few weeks and not remembering how to do anything.

You have to go through C bindings, but FFI is very far from being Go's strongest suit (if we don't count Cgo), so if that's what interests you, it might be better to explore a different language.

Given that the current Rust compiler does aggressive inlining and then optimizes, is this worth the trouble? If the function being called is tiny, it should be inlined. If it's big, you're probably going to spend some time in it and the call overhead is minor.

Runtime functions (eg dyn Trait) can’t be inlined for one, so this would help there. But also if you can make calls cheaper then you don’t have to be so aggressive with inlining, which can help with code size and compile times.

Probably? A complex function that’s not a good fit for inlining will probably access memory a few times and those accesses are likely to be the bottlenecks for the function. Passing on the stack squeezes that bottleneck tighter — more cache pressure, load/stores, etc. If Rust can pass arguments optimally in a decent ratio of function calls, not only is it avoiding the several clocks of L1 access, it’s hopefully letting the CPU get to those essential memory bottlenecks faster. There are probably several percentage points of win here…? But I am drinking wine and not doing the math, so…

> Debuggers

Simply throw it in as a Cargo.toml flag and sidestep the worry. Yes, you do sometimes have to debug release code - but there you can use the not-quite-perfect debugging that the author mentions.

Also, why aren't we size-sorting fields already? That seems like an easy optimization, and can be turned off with a repr.

If I can suggest, the next big breakthrough in this space would be generalizing niche filling optimization. Every thread about this seems to fizzle out, to the point that I couldn't even find which one is the latest any more.

Today most data-carrying enums decay into the lowest common denominator of a 1-byte discriminant padded by 7 more bytes before any variant's data payload can begin. This really adds up when enums are nested, not just blowing out the calling register set but also making each data structure a swiss cheese of padding bytes.

Even a few more improvements in that space would have enormous impact, compounding with other optimizations like better calling conventions.

Did you want alignment sorting? In general the problem with things like that is the ideal layout is usually architecture and application specific - if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.

> Did you want alignment sorting?

Yep. It will probably improve (to be measured) the 80%. Less memory means less bandwidth usage etc.

> if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.

I did suggest having a repr for situations like yours. Something like #[repr(yeet)]. Optimizing for false sharing etc. is probably well within 5% of code that exists today, and is usually wrapped up in a library that presents a specific data structure.

There was an interesting aproach to this, in an experimental language some time ago

   fn f1 (x, y) #-> // Use C calling conventions

   fn f2 (x, y) -> // use fast calling conventions

The first one was mostly for interacting with C code, and the compiler knew how to call each function.

Delphi, and I'm sure others, have had[1] this for ages:

When you declare a procedure or function, you can specify a calling convention using one of the directives register, pascal, cdecl, stdcall, safecall, and winapi.

As in your example, cdecl is for calling C code, while stdcall/winapi on Windows for calling Windows APIs.

[1]: https://docwiki.embarcadero.com/RADStudio/Sydney/en/Procedur...

Guess so. Unfamiliar with Zig. The point is that not a "all or nothing" strategy for a compilation unit.

Debugger writers may not be happy, but maybe lldb supports all conventions supported by llvm.

Very interesting but pretty quickly went over my head. I have a question that is slightly related to SIMD and LLVM.

Can someone explain simply where does MLIR fit into all of this? Does it standardize more advanced operations across programming languages - such as linear algebra and convolutions?

Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

> Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

Are people getting paid to repeat this ad nauseum?

> Can someone explain simply where does MLIR fit into all of this?

It doesn't.

MLIR is a design for a family of intermediate languages (called 'dialects') that allow you to progressively lower high-level languages into low-level code.

interesting website - the title text is slanted.

Sometimes people who dig deep into the technical details end up being creative with those details.

True, creative, but usually in a quality degrading way like here (slanted text is harder to read, also due to the underline being too thick, and takes more space) or like with those poorly legible bg/fg color combinations

The C calling convention kind of sucks. True, can't change the C calling convention, but that doesn't make it any less unfortunate.

We should use every available caller-saved register for arguments and return values, but in the traditional SysV ABI, we use only one register (sometimes two) for return values. If you return a struct Point3D { long x, y, z }, you spill the stack even though we could damned well put Point3D in rax, rdi, and rsi.

There are other tricks other systems use. For example, if I recall correctly, in SBCL, functions set the carry flag on exit if they're returning multiple values. Wouldn't it be nice if we used the carry flag in indicate, e.g. whether a Result contains an error.

"sucks" is a strong word but with respect to return values, you're right. The C calling conventions, everywhere really, support what C supports - returning one argument. Well, not even that (struct returns ... nope). Kind of "who'd have thought" in C I guess. And then there's the C++ argument "just make it inline then".

On the other hand, memory spills happen. For SPARC, for example, the gracious register space (windows) ended up with lots of unused regs in simple functions and a cache-busting huge stack size footprint, definitely so if you ever spilled the register ring. Even with all the mov in x86 (and there is always lots of it, at least in compiled C code) to rearrange data to "where it needed to be", it often ended up faster.

When you only look at the callee code (code generated for a given function signature), it's tempting to say "oh it'll definitely be fastest if this arg is here and that return there". You don't know the callers though. There's no guarantee the argument marshalling will end up "pass through" or the returns are "hot" consumed. Say, a struct Point { x: i32, y: i32, z: i32 } as arg/return; if the caller does something like mystruct.deepinside.point[i] = func(mystruct.deepinside.point[i]) in a loop then moving it in/out of regs may be overhead or even prevent vectorisation. But the callee cannot know. Unless... the compiler can see both and inline (back to the C++ excuse). Yes, for function call chaining javascript/rust style it might be nice/useful "in principle". But in practice only if the compiler has enough caller/callee insight to keep the hot working set "passthrough" (no spills).

The lowest hanging fruit on calling is probably to remove the "functions return one primitive thing" that's ingrained in the C ABIs almost everywhere. For the rest ? A lot of benchmarking and code generation statistics. I'd love to see more of that. Even if it's dry stuff.

> Well, not even that (struct returns ... nope).

C compilers actually pack small struct return values into registers:

https://godbolt.org/z/51q5se86s

It's just limited that on x86-64, GCC and Clang use up to two registers while MSVC only uses one.

Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions that are defined by the various runtime environments (usually the combination of CPU architecture and operating system). C compilers just must adhere to those CPU+OS calling conventions like any other language that wants to interact directly with the operating system.

IMHO the whole performance angle is a bit overblown though, for 'high frequency functions' the compiler should inline the function body anyway. And for situations where that's not possible (e.g. calling into DLLs), the DLL should expose an API that doesn't require such 'high frequency functions' in the first place.

> Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions [ ... ]

I did not say that. I said "C calling conventions" (plural). Rather aware of the fact that the devil is in the detail here ... heck, if you want it all, back in the bad old days, even the same compiler supported/used multiple ("fastcall" & Co, or on Win 3.x "pascal" for system interfaces, or the various ARM ABIs, ...).

Clang still has some alternative calling conventions via __attribute__((X)) for individual functions with a bunch of options[0], though none just extend the set of arguments passed via GPRs (closest seems to be preserve_none with 12 arguments passed by register, but it also unconditionally gets rid of all callee-saved registers; preserve_most is nice for rarely-taken paths, though until clang-17 it was broken on functions which returned things).

[0]: https://clang.llvm.org/docs/AttributeReference.html#calling-...

（评论） (comments)

（评论）
(comments)