![]() |
|
![]() |
| Good to know.
(Also sibling is right, I spaced on X-CF- being a header sent to CF customers’ servers. I don’t used cloudflare but cloudfront does the exact same thing) |
![]() |
| CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5). |
![]() |
| Tries are huge for Scrabble move generators and playing programs, of which I've written a few. The fastest structures are compressed tries with rotated representations of words. |
![]() |
| I wonder how much faster it gets when you shorten the maximum internal key length. If you re-map everything to like "cf-1", "cf-2", .. "cf-;" then you'd only need to hash like 4 characters. |
![]() |
| I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.) |
![]() |
| regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.
But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences: * The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects. * The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.) * The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route. * The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance. * trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie. |
![]() |
| This is a gratifying read (along with the article) because I've been working on closely related things, and had come to the tentative conclusion that taking Aho-Corasick, anchoring it (so no failure transitions), and laying it out in contiguous memory, basically results in Liang's packed trie from TeX's hyphenation algorithm.
trie-hard seems to be effectively that, but with some clever masking to get mileage out of having a limited byte vocabulary to deal with. The DFA might pull ahead, but I wouldn't just bet on that. Branch predictors don't like lookup tables much, and the memory locality hit is small but real: more of a packed trie fits in a cache line than would a DFA. You seem to have landed on similar conclusions here: beating a packed trie for this specific use case is going to be tough. I also like that they hit on the same popcount-and-shift technique I've used in a data structure for UTF-8 character sets: https://github.com/mnemnion/runeset. I admittedly haven't benchmarked this much, correctness first and all, but seeing that Cloudflare has seen good numbers for something closely related reinforces the few tests I have run. As a text-matching connoisseur, you may find the runeset interesting. It turns out to be hard to search for papers on "UTF-8 character set", so if you're aware of any prior art here I'd love a heads-up. It's fundamentally a generalization of using u64 bitmasks for sets of ASCII, which is as old as the hills at this point. |
![]() |
| Shouldn't this line: > "using regular expressions clocks in as taking about twice as long as the hashmap-based solution." be written as the opposite? That the RE takes half as long/is twice as fast? |
![]() |
| By looking at the source code [1] they have a discriminated union with 3 variants representing the possible combinations of representing a valid word and leading to more nodes (excluding the case where it is neither, which is impossible). So the node for the 'o' in "do" should be a `SearchOrLeaf` storing "do", its corresponding value (the `T`, in a set this would be empty) and the `SearchNode` containing the successors, which should only contain a 't' to "dot".
[1]: https://github.com/cloudflare/trie-hard/blob/3d8163ca2d9208c... |
![]() |
| Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial. |
![]() |
| They only measure the CPU utilization of clear_internal_headers and they don't show what was the impact of the change to the overall runtime performance. The latter is what I was referring to. |
![]() |
| It’s a bit wild to me that one needs to add so many http headers internally that they need to get stripped off.
The “trie-hard” crate is a solution to the layers of cruft at CF |
![]() |
| This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!
...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/ |
- An entire separate dictionary or other data structure.
- One single header containing all internal metadata.
- All headers have a prefix, and the internal ones start with I and the external ones start with E.
- All internal headers start with “CFInt”.
I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)
The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.