（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41230344

提供的文本讨论了 Spice 和 Rayon 等并行处理库的使用，重点关注“开销”的概念。开销是指并行化所需的额外时间，这会导致计算时间增加而不是减少。作者认为，在处理较小的工作负载单元时，并行化可能会增加开销，导致挂钟时间增加而不是减少。它建议读者在解释使用这些库带来的性能改进时要小心，特别是在评估延迟（响应时间）与吞吐量（每单位时间完成的任务数）时。对于小型、短持续时间的任务，最小化延迟是关键，而对于较大的任务，提高吞吐量至关重要。最后，文本强调了在项目中实现特定库之前充分理解它们的重要性，特别是考虑测试覆盖率、回归测试和并发管理等因素，以确保可靠性并避免因实施选择不当而产生的潜在问题。

Oh this is super interesting. I was only aware of the two first while writing Spice. I’ll definitely look into the two last as well. Thanks for sharing!

I haven’t read through the code in detail but I can tell you “sub-nanosecond overhead” is misleading and marketing fluff. On first look, the measure seems to be some convoluted “time per thing” where the number of threads is far far smaller than the number of “thing”s

Author here.

I knew that some people would react negatively to the term, but I can assure the intention is for you to have a better understanding of exactly how and when you should use Spice and Rayon. I would recommend reading the benchmark document: https://github.com/judofyr/spice/blob/main/bench/README.md.

What people typically do when comparing parallel code is to only compare the sequential/baseline with a parallel version running at all threads (16). Let's use the numbers for Rayon that I got for the 100M case:

- Sequential version: 7.48 ns.

- Rayon: 1.64 ns.

Then they go "For this problem Rayon showed a 4.5x speed-up, but uses 16 threads. Oh no, this is a bad fit." That's very true, but you don't learn anything from that. How can I use apply this knowledge to other types of problems?

However, if you run the same benchmark on varying number of threads you learn something more interesting: The scheduler in Rayon is actually pretty good at giving work to separate threads, but the overall work execution mechanism has a ~15 ns overhead. Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon. (Unless it's more important for use to reduce overall latency at the cost of the throughput of the whole system.)

In comparison, if you read the Rayon documentation they will not attempt to give you any number. They just say "Conceptually, calling join() is similar to spawning two threads, one executing each of the two closures. However, the implementation is quite different and incurs very low overhead": https://docs.rs/rayon/latest/rayon/fn.join.html.

(Also: If I wanted to be misleading I would say "Spice is twice as fast as Rayon since it gets 10x speed-up compared to 4.5x speed-up")

Thanks for the answer, this part is particularly interesting indeed:

> Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon.

That's a very interesting project.

The big limitation I see with the current approach is that the usability of the library is much worth than what Rayon offers.

The true magic of Rayon is that you just replace `iter()` with `par_iter()` in you code and voilà! now you have a parallel execution. But yes it has some overhead, so maybe Rayon could try and implement this kind of scheduling as an alternative so that people pick what works best for their use-case.

Too late to edit so I'll put it here:

> the usability of the library is much worse than what Rayon

I'm a little bit ashamed to see that this fairly upvoted comment of mine has such an stupid English mistake in it…

You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

I've seen systems that can achieve 99% efficiency on thousands on nodes for real useful applications that involve non-trivial synchronization. Now that is an impressive feat.

Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

> You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

The linked benchmark document does this. 82% for 4 cores on Spice with a small workload, and 69% for 16 cores on Spice with a large workload. Compared to about 25% for Rayon on 4 cores with a small workload and 88% for Rayon on 16 cores with a large workload.

> Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

The entire point of the linked benchmark README.md is to deal with insufficiently long program runs. Spice is an attempt to allow parallelization of very small amounts of work by decreasing fixed overhead. Perhaps such a thing is not useful but that doesn't prevent it from being interesting.

Adding more cores doesn't change the time per operation. Your graphs are grossly wrong. What you should have done is drop the nanoseconds and just take the total execution time. Whenever you're writing 1.64ns, you should have written 164ms.

The overhead should be measured as a percentage versus a theoretical base line such as perfect linear speedup. You haven't shown the ideal scenario for each core count, so how are we supposed to know how much overhead there really is?

The single core scenario is 363ms and linear speedup for 32 cores gives us 11.3ms. Your benchmark says you needed 38ms. This means you achieved 31% of the theoretical performance of the CPU, which mind you is pretty good, especially since nobody has benchmarked what is actually possible while loading all cores by running 32 single threaded copies of the original program, but you're advertising a meaningless "sub nanosecond" measure here.

At your link it also says:

"Spice shows subpar scalability: The speed-up of using 16 threads was merely ~11x"

If that is true, then "Spice" is suitable only for small tasks, which can be completed at most in milliseconds, which can benefit from its low overhead, while for any bigger tasks something better must be used.

Author here.

I’d maybe phrase it as “Spice is not optimal” instead of “Spice is not suited for”, but yes, that’s the conclusion for this benchmark.

I’m hoping/assuming that for a more typical case (more CPU work being done) Spice will scale better, but I haven’t done the benchmark yet.

IMO this is the least convincing part of the benchmark though, since it's uninterpretable without an optimal baseline. You don't know how much of this is because of Spice and how much is because of how the task scales. (This is acknowledged as future work.)

That is the ecological niche of Rayon (cited) as well, isn’t it? You need to process a lot of things (thousands to millions), you want to parallelize that processing as much as possible (couple dozen of cores tops), you want to not get killed by scheduling overhead. So you account for the per-thing overhead.

Most engineers aren't precise with throughput vs latency. Ideally you should report both figures (and anything else salient in performance-sensitive spaces), but it's less a lie and more an extremely commonplace mode of thinking and speaking.

Moreover, I think that mode of thought comes from the fact that most programming problems don't have hard latency bounds, so throughput dominates the conversation. If I'm spending 10us on average while handling a 10ms soft deadline, every single component can easily be occasionally 100x more expensive (latency) without me caring, and if it buys me another 1us on average (throughput) then I'll save gobs of money in compute.

Kind of, but not in a way that matters for this scheduler (at least, I posit, not usually). If you have a single task with a wall-clock time in the 10-1000ms range and want to make it "faster," yes, you probably want to improve the latency. However, the latency introduced by this experimental library is negligible on the timescales being considered. With that in mind, a throughput improvement is a better description of the sorts of engineering efforts you need to accomplish your goals.

From a slightly different perspective, one thing the library does on top of handling small workloads well is handling _finely grained_ workloads. That's super important from a usability point of view, since you can express the parallelism in the most natural way and let the library handle the fact that it's hard to schedule that task (e.g., most schedulers suck as soon as you're talking about work items on the order of a single cache line). Just being able to have a convenient abstraction when writing a long-running job is also a fantastic feature. In that case, we would still care about throughput, not latency.

Separately, though definitely more rarely, there are jobs which are sub-second but where it's hard to batch many of them at once. If you're forced to execute many and care about the time till completion (latency, again at a much longer timescale), being able to make each one much faster is a big deal.

I really think that first paragraph is a commonplace occurrence though. Something important is slow (measured in "human" blinking timescales), so you slap a better scheduler and some parallelism at it and start work on the next ticket. Ripgrep, at some level (it has lots of other technical accomplishments; I don't want to let this comment give the mistaken impression that I'm demeaning the project) is useful precisely because it throws parallelism at sub-second problems.

Most engineers aren't precise with throughput vs latency.

Anyone making claims about performance should know the difference.

This isn't even about either, it's lying about overhead by not counting it correctly.

If someone asks what your cable bill is and you say it's only $2.50 a month because you have 32 TVs, no one is going to say that makes sense.

It's not "lying" to use the words most likely to correctly get your point across to your target audience. Maybe they could have communicated better (for a seemingly dead project, IMO they put enough time in regardless), but it's not lying.

> Anyone making claims about performance should know the difference.

They probably do know the difference. Knowing the difference isn't the thing you're quibbling with.

> If someone asks what your cable bill is and you say it's only $2.50 a month because you have 32 TVs, no one is going to say that makes sense.

Sure...because when asking about your cable bill it's unambiguous that you want the total number of dollars. The whole reason there's any issue at all here is that some of their language is ambiguous if you don't consider the target audience and don't analyze their charts or read the descriptions of those charts. Picking an analogy which resonates precisely because of the lack of ambiguity doesn't say a whole lot about the actual problem at hand.

The whole point of overhead is that it is a base constant on top of whatever you do. The whole point of overhead is to eventually be amortized in some way. When someone asks what the overhead is, it is a lie to try to factor amortization in because it's implied that it will be done somewhere anyway.

I think we're maybe talking past each other with definitions.

I think something is a lie when somebody intentionally tries to mislead (even when communicating "facts" if the intent is to maliciously shape the recipient mind). I think when they intentionally try to convey the "right" ideas (as I think this author is doing -- again, using words the way you think most of your peers use them is usually a good thing), that's not a lie. There are gray areas in between. I don't think this is a gray area, except insofar as I don't know all the facts perfectly.

You still seem to disagree with me though. I'd really like to understand why. I have a couple theories, which I'll list, but I'm interested in any insights you have:

1. You disagree with my description of what "lying" is.

2. You think the author is less deserving of trust than I do (since they seem to be profiting almost nothing other than a tiny bit of publicity, since they did outline in their README exactly what they meant, with photos, and since the title didn't stoop to clickbait, my default position is that they likely weren't too malicious, but you might have different priors and might also disagree with my opinions of those conflicts of interest).

> the whole point of overhead ...

That's perhaps the last pertinent thing we disagree about. I'll throw out my opinion, and I sincerely want to know your thoughts on the matter:

An extremely commonplace occurrence is that something is slow and you want to speed it up (measuring perceived latency). If you can magically do that by adding parallelism then it's a good day at work. Unfortunately, parallelism can easily add "overhead" -- parallelizing small or finely grained problems usually increases wall-clock timing, usually substantially.

In that mode of operation, the whole problem you're trying to solve is reducing a certain latency measurement, making it clear from context what "overhead" means. Other definitions don't even enter into the equation.

Zooming out a bit, I've heard that phrasing a ton of times in many different companies. Sure, it's better if people know to speak more precisely (and I think we're in agreement that a few more performance characteristics would be nice to know here), but it's hard for me to imagine a world were my default perception of somebody is that they're lying when they're using the exact same words and meanings as scores of other good engineers. _Maybe_ they are, but that's by no means my default assumption.

This seems like a lot of writing that has nothing to do with the topic. If someone wants to know the overhead of something and you tell them the overhead divided by an arbitrary number already, that's lying.

Yesterday, he posted on Reddit and I expressed some concern with the benchmarks. The benchmarks are claiming 0.36 ns of overhead per call, but only the computing function. There is a second thread running doing the schedule that the overhead numbers don't include. It seems pretty clear he's running on a hyperthreaded 8 core machine (so 16 threads), I was guessing 3 Ghz, so that literally a single cycle of overhead.

Each extra thread adds more overhead from lock contention. At 16 threads overhead is up to 3.6 ns (so 10 times more). I'm guessing, but that would mean the 0.36 ns of overhead included an uncontested lock? That's not possible. There's some other weirdness going on in the benchmark data too. So either I'm not understanding what he's actually timing or maybe there is a bug in the benchmark code.

Also, if you multiply all the values out, I think he's timing in milliseconds (when runtime is calculated and converted to millis, they come out as whole numbers). Don't most benchmarkers have better precision than that? Maybe he's just using `time prog` and the data is just really dirty. Or maybe he's just choosing really, really bad metrics that totally useless for this (this is probably correct, just not sure if there are more issues).

I don't understand why it's unbelievable. They walk through the required logic to perform scheduling and dispatch; it looks incredibly simple and extremely suited to hiding on an OoO. The benchmark is summing down a binary tree which is going to give you a lot of space to hide what's at most a handful of instructions. There's obviously no lock contention because, like, look at the algorithm, what locks contending on what?

AFAICT this is just a really cool and really practical algorithm for microthreads with near zero cost per semantic thread.

> There's obviously no lock

What? The threadpool has a shared mutex accessed by each worker thread, which is used for pushing work on heartbeat and for dequeuing work. https://github.com/judofyr/spice/blob/167deba6e4d319f96d9d67...

"Adding more threads to the system will not make your program any slower" is an insane claim to be making for any lightweight task scheduling system, and trivially not true looking at this: if you have 1000 threads as workers and degenerate tree program than each worker will take turns serializing themselves through the mutex. Things like these are massive red flags to anyone reading the README.

The README covers this. Its argument is persuasive. If your point is that the constant is badly tuned for theoretical 1000 core machines that don't exist, I'm not sure I care. A 100ns stall at most every 100us becoming more likely when you approach multiple hundreds of cores is hardly a disaster. In the context of the comment I replied to, the difference between 8 and 16 workers is literally zero, as the wakeups are spaced so the locks will never conflict.

Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.

Did you read the README at all? I thought it was extremely precise about what exactly is meant by the claim in the title. Not much room for misunderstanding.

The title is totally fine. There is no title with zero room for misinterpretation. All I took from it is that it's a library with extreme low latency by some kind of measure.. and then I went to the readme to see exactly what that measure was. Very straightforward.

I'm not terribly familiar with this space, but I do like the concurrency model presented here.

I think the README here is very well written, and I have a good idea of what's going on just from reading it, but there are a few areas where I'm left scratching my head. Thankfully the code is fairly easy to read.

One thing I noticed was that when I build in Debug mode `zig build`, I got:

    Baseline,3.92809172
    Spice 1 thread,19.1012624

When I built with ReleaseSafe mode `zig build --release=safe`, I got:

    Baseline,3.264224280000001
    Spice 1 thread,3.78043278

So Spice does have a notable degradation of performance if not built for release. The baseline implementation does not share the same degree of degradation when built with the Debug zig build mode.

(zig 0.13.0)

Want to state off the bat this this project is awesome and huge kudos to the author for spending their time, attention, and energy 1) working diligently to get this working at all and 2) sharing it with the broader HN community, who are generally known to by hyper-critical to a pedantic degree and/or overly pessimistic (cough the initial Docker project Show HN thread cough)

I also really appreciate that the author recognizes the limits of their own project, which preemptively addresses most of the usual snark.

> Lack of tests: Spice contains a lot of gnarly concurrent code, but has zero testing coverage. This would have be improved before Spice can be responsibly used for critical tasks.

Testing correctness of execution for critical tasks is one thing, but I would expect a library which implements "gnarly concurrent code" to at least have regression tests — what guarantee is there to an end-user that functionality which exists in a working state today might not break tomorrow due to a subtle yet nefarious regression?

sqlite has 590 times as much test code and test scripts as it does raw c source code [0]; this fact, along with its stability and portability, is one of the numerous reasons why it has proliferated to become the defacto embedded database used across the planet. While we're comparing apples to oranges in this contrived example, the general point still stands — regression tests beget stability and confidence in a project.

In epics where I work, if we _must_ defer baseline regression tests, we usually create a follow-up ticket inside of the same epic to at least write them before feature/epic launch, usually.

[0]: https://www.sqlite.org/testing.html

> Spice is primarily a research project. Read along to learn more about it, but if you're considering using it in production you should be aware of its many limitations.

Ah, I missed that upon first read. In that case, that caveat/limitation is definitely justified.

> (cough the initial Docker project Show HN thread cough)

Docker was largely met with enthusiasm here when it was launched. I believe you must refer to how Dropbox was received — famously negatively, initially.

Per the description this uses busy waiting in the workers to get to nanosecond level latencies. I wonder if anyone has a perspective on how realistic busy waiting is in large applications with tens of thousands of tasks? Maybe it works if the tasks are async (i.e. not thread based) so that you only have N waiters where N is the size of the executor’s thread pool? In any case energy consumption of such an architecture would be higher.

Related, I’ve been interested a while whether there’s a faster way for a producer of work to have a consumer wake up without resorting to busy waiting, possibly by running the consumer in the producer time slice.

Also related, I’ve wondered if it’s possible to have a user space FUTEX_WAKE operation that would halve the typical penalty of waking up a consumer (to just the consumer).

But it's not very cooperative, as in tasks yielding to each other. They mostly cooperate by letting some tasks to be given to other threads, and not all the time, but once in a heartbeat. The scheduling happens rarely, so its amortized cost is low.

sure, but that's all details, and there are other cooperative scheduling models which aren't excessively eager. this isn't to be a downer on this implementation, or other implementations, but just generally pointing at cooperative advantages, of which there are many, with one particular implicit and major downside, which is starvation when the cooperation contract is not adhered to.

（评论） (comments)

（评论）
(comments)