![]() |
|
![]() |
| What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative. |
![]() |
| It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.
Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p |
![]() |
| I think you meant 2N+2M vs 2N+M, but yes, that’s the point: not reading twice is cheaper because unlike in traditional big-O analysis compute is cheap but data access is very expensive. |
![]() |
| > this does not mean that the median can be found accurately by your method.
You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust. |
![]() |
| The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :) |
![]() |
| that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms |
![]() |
| I like the idea to use super additivity, but in a proof you cannot creatively extend T to the reals, this should be fixed.
Here is the slightly mopped up proof i had in mind, when i posted my hints below:
|
![]() |
| Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct. Indexing an array with a float raises an exception, I believe. |
![]() |
| You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory. |
![]() |
| Most quantile sketches (and t-digest in particular) do not assume stationarity.
Note also that there are other bounds of importance and each has trade-offs. T-digest gives you a strict bound on memory use and no dynamic allocation. But it does not have guaranteed accuracy bounds. It gives very good accuracy in practice and is very good at relative errors (i.e. 99.999th percentile estimate is between the 99.9985%-ile and 99.9995%-ile) KL-sketch gives you a strict bound on memory use, but is limited to absolute quantile error. (i.e. 99.99%-ile is between 99.9%-ile and 100%-ile. This is useless for extrema, but fine for medians) Cormode's extension to KL-sketch gives you strict bound on relative accuracy, but n log n memory use. Exponential histograms give you strict bounds on memory use, no allocation and strict bounds on relative error in measurement space (i.e. 99.99%-ile ± % error). See the log-histogram[1] for some simple code and hdrHistogram[2] for a widely used version. Variants of this are used in Prometheus. The exponential histogram is, by far, the best choice in most practical situations since an answer that says 3.1 ± 0.2 seconds is just so much more understandable for humans than a bound on quantile error. I say this as the author of the t-digest. [1] https://github.com/tdunning/t-digest/blob/main/core/src/main... |
![]() |
| Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].
You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile. For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile. The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down). [1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m... |
![]() |
| No, it's O(N+M) where N is the number of strings and M is the sum of the lengths of the strings. Maybe your radix sort has some problems?
I evaluated various sorts for strings as part of my winning submission to https://easyperf.net/blog/2022/05/28/Performance-analysis-an... and found https://github.com/bingmann/parallel-string-sorting to be helpful. For a single core, the fastest implementation among those in parallel-string-sorting was a radix sort, so my submission included a radix sort based on that one. The only other contender was multi-key quicksort, which is notably not a comparison sort (i.e. a general-purpose string comparison function is not used as a subroutine of multi-key quicksort). In either case, you end up operating on something like an array of structs containing a pointer to the string, an integer offset into the string, and a few cached bytes from the string, and in either case I don't really know what problems you expect to have if you're dealing with null-terminated strings. A very similar similar radix sort is included in https://github.com/alichraghi/zort which includes some benchmarks, but I haven't done the work to have it work on strings or arbitrary structs. |
![]() |
| > For MSD, do you get a performance hit from putting your stack on the heap (to avoid overflow on recursion)?
I'm not sure. My original C++ implementation which is for non-adversarial inputs puts the array containing histograms and indices on the heap but uses the stack for a handful of locals, so it would explode if you passed the wrong thing. The sort it's based on in the parallel-string-sorting repo works the same way. Oops! > cache misses & memory pressure due to potentially long input elements (and thus correspondingly large stack) I think this should be okay? The best of these sorts try to visit the actual strings infrequently. You'd achieve worst-case cache performance by passing a pretty large number of strings with a long, equal prefix. Ideally these strings would share the bottom ~16 bits of their address, so that they could evict each other from cache when you access them. See https://danluu.com/3c-conflict/ > Where do you find these? :-) There's a list of cool sorts here https://github.com/scandum/quadsort?tab=readme-ov-file#varia... and they are often submitted to HN. |
![]() |
| I'm wondering what heap approach can solve that problem, as I can't think of any. Hopefully OP got a link to the thesis.
The n log n approach definitely works though. |
![]() |
| It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions. |
![]() |
| It's quicksort, but neglecting a load of the work that quicksort would normally have to do. Instead of recursing twice, leading to O(nlogn) behaviour, it's only recursing once. |
![]() |
| I found this part of the code quite funny:
Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1][1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/ |
![]() |
| > The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.
Except that there is no such thing as "arbitrarily large storage", as my link in the parent comment explained: https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/ So why would you want to talk about arbitrarily large input (where the input is an array that is stored in memory)? As I understood, this big-O notation is intended to have some real-world usefulness, is it not? Care to elaborate what that usefulness is, exactly? Or is it just a purely fictional notion in the realm of ideas with no real-world application? And if so, why bother studying it at all, except as a mathematical curiosity written in some mathematical pseudo-code rather than a programming or engineering challenge written in a real-world programming language? Edit: s/pretending/intended/ |
![]() |
| The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me |
![]() |
| They're not sorting all the arrays?
Later (i was going to delete this comment, but for posterity, i misread --- sorting the lists, not the contents of the list, sure) |
![]() |
| It’s unambiguously O(n), there’s no lg n anywhere to be seen. It may be O(n) with a bit larger constant factor, but the whole point of big-O analysis is that those don’t matter. |
![]() |
| Actually lots of algorithms "feel" like cheating until you understand what you were not looking at (fast matrix multiplication, fast fourier transforms...). |
![]() |
| > If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning.
Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example. So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average. > Heavily skewed distributions would perform pretty badly That’s why I used the weasel worlds “real world data” ;-) I also thought about mentioning that skew can be computed streaming (see for example https://www.boost.org/doc/libs/1_53_0/doc/html/accumulators/...), but even if you have that, there still are distributions that will perform badly. |
![]() |
| The median-of-median comes at a cost for execution time. Chances are, sorting each five-element chunk is a lot slower than even running a sophisticated random number generator. |
![]() |
| 3 and 4 elements will fail to prove the complexity is linear
You still can do 3 or 4 but with slight modifications https://arxiv.org/abs/1409.3600 For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear |
![]() |
| Sure, naming is hard, but avoid "l", "I", "O", "o".
Very short variable names (including "ns" and "n") are always some kind of disturbance when reading code, especially when the variable lasts longer than one screen of code – one has to memorize the meaning. They sometimes have a point, e.g. in mathematical code like this one. But variables like "l" and "O" are bad for a further reason, as they can not easily be distinguished from the numbers. See also the Python style guide: https://peps.python.org/pep-0008/#names-to-avoid |
It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.
As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.
The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.