（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41501496

Cloudflare 处理内部标头并删除它们的方法涉及为每个内部标头创建唯一标识符，而不是将它们分隔在特定列表中。这些独特的 ID 使 Cloudflare 能够轻松删除不必要的信息，同时保持功能。然而，将标头存储在特定列表中以进行识别的方法被认为是复杂的，容易出错，并且由于设置差异算法需要额外的计算或由于同时执行程序或不正确的标头命名约定而引起的冲突而可能速度较慢。在分析这种方法的效率时，使用统计分析软件 R 进行了一项研究。研究发现，通过使用“trie”（一种用于高效检索的树状数据结构）来识别内部标头的速度没有显着提高基于其前缀的键。此外，研究表明，在某些情况下，“trie”的表现甚至可能比传统方法稍差。还值得注意的是，由于需要检查字符串中的每个元素，应用于字符串的哈希函数可能需要大量计算。尽管如此，Cloudflare 声称已经通过解决列表或图形结构中常见的内存跳跃问题来改进这个问题，从而提高“trie”的性能。此外，该研究将“trie”与自动机理论 (Aho-Corasick) 和正则表达式 (regex) 等替代方案进行了比较，发现“trie”对于涉及较短字符串的查找速度明显更快。总体而言，Cloudflare 用于管理内部标头的方法似乎很有效，在效率和性能方面提供了潜在的优势，特别是在处理较短的字符串时。然而，需要进一步的研究和测试才能完全确定各种工作负载和环境的最佳解决方案。

If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:

- An entire separate dictionary or other data structure.

- One single header containing all internal metadata.

- All headers have a prefix, and the internal ones start with I and the external ones start with E.

- All internal headers start with “CFInt”.

I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)

The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.

Former employee here; the interesting (horrifying?) fact is that you can set some of these internal headers (I remember a notable bug with cf-cache-status) in workers (the serverless offering) and cause all kinds of bad things to happen :)

Ive worked at several huge corporations in IT security, where we care about headers a lot, and they all use headers in a manner similar to CloudFlare.

Including using proxies at the edge to strip out internal headers bidirectionally- yes, inbound too.

The first large scale piece of software I worked on was for telcos pre smart phone. We used internal headers to offload authentication and terminate SSL. We also had to pressure F5 to fix about half a dozen bugs in BIG-IP to do so. Bugs that should in no universe have existed in version 9 of a product.

I used to joke that F5 owed me and my coworker 3 months of salary for all the free QA we did for them.

It helps if you realize that BIG-IP 9.0 was essentially a from-scratch rewrite of BIG-IP 4.5. Among other major redesigns, the data plane was moved from BSD kernel space to Linux user space. Internally, the joke was that it would be two times better when we were done (4.5 * 2 = 9.0). It probably was, but not on day zero.

On day sixty it couldn’t do SSL termination and cookie based traffic shaping didn’t work.

I was a little bummed it was “just” a Linux box but that’s pretty common today. I hadn’t discovered dd-wrt yet and wouldn’t for a few years.

Are you on a corporate network? Do you use a firewall at home?

You’re on enclaves all the time. This is just a different one. Separate networks per class of traffic used to be de rigeur before Cloud. Now it’s all munged together.

I understand that security needs to be pragmatic, but because security does it, doesn't make it right.

It isn't about being in the enclave, it is having to keep track of what headers you set vs external. It is fragile and error prone and it will absolutely break someone else when there is a name collision.

All security exploits are a subset of bugs.

And what would you have them do instead? SSL to every service? Reauthenticate on every single call? What’s your clever solution?

There’s probably a dozen things you can terminate at a boundary that would cost so much to run through the entire system that it would bankrupt a company.

And then there’s tracing, which also uses headers.

Maybe I'm dumb or missing something, but why not just do what Google does -- convert all inbound HTTP requests into Protobufs, separate out the internal stuffs from the external stuffs, run it through your myriad of internal services, and then on the way out you still have your nicely delineated protobufs which you can convert back into HTTP. Why are we mucking about with headers in the first place?

Good to know.

(Also sibling is right, I spaced on X-CF- being a header sent to CF customers’ servers. I don’t used cloudflare but cloudfront does the exact same thing)

X-CF- seems to be used by some other software. And Cloudflare has plenty of documented headers that don’t start with X.

The whole point is that a genuine user header should never conflict with a Cloudflare internal header, and a browser should never see an internal header.

Having re-read the thread, I totally agree. What you have listed is table stakes. I'd also say that internal headers would also be encrypted to a narrow protection domain.

If all internal headers were prefixed with X-CF-, you could strip them all via SIMD that had no knowledge of any specific header. Hell, you could probably do it on the NIC.

Configure all machines in a way that they can survive on some random untrusted public wifi. This should be the obvious default stance for laptops and phones.

But even for workstations wired to a trusted LAN it still makes sense because you never know which of the various tunnels might assign and expose some IPv6 address to the internet.

For servers you might be able to make an exception if you have a vigilant IT and the people with admin access aren't actively trying to circumvent security, but even then that better not be your only layer of security.

I feel like I can point out lots of similar problems to the other solutions you suggest (heck, I thing some of those problems you list even apply to those other solutions).

The list approach has some downsides, but it also has a bunch of upsides. I feel like when people like to point out potential flaws of these approaches, they're ignoring the history and difficulties that comes with Cloudflare's scope. An enumerated list is the simplest and most flexible of the approaches, and it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare, potential technology acquisitions, etc.

> it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare

That’s more of an indictment of how little effort is spent aligning things. It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

* Headers that are originally private, but are later changed to public

* Headers that are intended to be public, but then are changed to private

* Headers that were created in the very early days of the company, which are so deeply integrated into everything that renaming them is basically impossible (or at least very hard and risky)

* Headers that are set by some appliance/third party solution not controlled by CF, but which are technically internal and must be stripped.

* Newly acquired company has its own set of internal headers. Management is not amused when a huge, disrupting refactor is suggested by engineers.

And this is just a tip of the iceberg of possible problems when your scale is big enough.

So because everything is a Byzantine labyrinth due to scale we're all going to agree to remember to add internal headers to strip to another service?

Maintaining the header to strip list is just another piece of technical debt to keep paying down.

Having a policy to strip everything under a prefix seems much less error prone.

> It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.

To be a little bit blunt, this just means you have never built software in a very large company that may have a lot of existing/completely independent software, or have made technology acquisitions after the fact. That is, if you are starting from "time 0" of course it's very easy to say "internal headers need to start with CFInt", but that's rarely the case.

I think if you work for such a very large company the only reason you are ok with it is because nobody at the top says ‘no more’.

If they do, it suddenly becomes very easy to get everything aligned, because there is likely a vast collection of bodies just collecting a paycheck that can be more productively assigned to manually renaming everything.

Lists, extensible protobufs, etc are indeed great in their extensibility. The issue I would see (if I worked at Cloudflare) isn’t to using a list — it’s that the list is the same list that’s used for externally visible HTTP headers.

I don't see that as a problem at all:

"Hey any teams out there, before you can add internal headers, you need to register them with some service." Then the list is loaded from static storage at startup time and/or refreshed on some periodic basis.

Yeah, me too, but systems grew over time and grew and grew and we were using HTTP headers for all sorts of stuff. This optimization makes the situation better, but the solution (which is underway) is to use a different mechanism for IPC and get rid of the use of HTTP headers completely.

I’m certainly familiar with systems growing over time and outgrowing their original schema.

Is this the same phenomenon that resulted in Cloudflare Tunnels apparently being domain names? Or why domains that are proxied by Cloudflare show up as “CNAME” in the DNS panel? (The latter one seems extra strange — wasn’t the reverse proxy service the very first service that Cloudflare offered? There must be some history here.)

All new products are full of scaffolding that has to be removed when robustness becomes a higher priority than existence. Where we fail is not calling it that. Instead we just call some code “good” while our thicker skinned coworkers call the rest “fine”. It’s not fine. I don’t want to be working overtime the week of thanksgiving because you can’t think past the horizon.

There is a very fine line between "good enough" and "not good enough" in any product beyond a certain complexity. Finding the pieces that cross that line can be hard and improving "good enough" parts is (sadly) mostly a waste of time in commercial settings.

You have to backdate starting a project so it happens before things stop being good enough. And take into account bad estimates and the five other deadlines between then and now.

Often it’s better to clean as you go. If you have the time and the inclination, shore up one of the least good enough things.

It’s not unheard of to open up new feature sets via refactoring. Something that previously would have taken an “impossible” amount of time now becomes profitable, not just possible.

I miss working at a place where that was encouraged. Now if I do that the testers complain that it’s not in scope for the ticket and to split the changes into another ticket with a separate test plan. The rot thus continues.

I think this is about when I discovered zone defense for performance optimization. QA doesn’t want to chase your code changes halfway across the codebase, and that’s fair. At some point it starts looking like you’re changing things just to change them, even if it ends up being a few percent better.

But if you limit yourself to one part of the interaction, they don’t get as mad. A dozen changes in the signup system piss them off less than five changes across the app.

Indeed. I am very happy with some of the tricks that I developed (speeding up, and indeed allowing, Solaris C++ linking by prefiltering on MD5 hashes of canonicalised .o files -> nm, anyone?) even though they were lost without trace (yep, Lehman Brothers, where are my Imakefiles now?)...

I really hope not, given that that very old Solaris compiler should have been laid to rest many many years ago. Yes, some of my stuff had kinda made it into BarCap (even onto my desktop dev machine!) when I turned up there later, and even some at Nomura when I rocked up there again IIRC!

Other ideas:

  - Ensure all headers added internally that are not for export at the front of the header list
  - Preseed the hashtable of all requests with internal headers you plan to remove.

In fact, if you preseed, you are basically combining these ideas but fixing how many internal headers are on each request. At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

> At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

While Python provides data structures like this out of the box, designing a big system to require a special, nonstandard data structure to identify private metadata seems like a mistake to me.

I'm not sure I follow? Does rust not have the equivalent of a LinkedHashMap from Java? All I'm proposing is to start each map of headers off with a sentinel value for all of the headers you don't intend to send back to the client. Then, have a method that iterates over all values, for the places you need to send to internal servers, and another (probably the default?) that starts iteration after all of the internal items, which should just be a different pointer. At this point, there is no need to inspect each header by name, as that was effectively done at compile time when you put the protected headers at the front of the list.

Is this somewhat special? I mean, sure. But it wasn't long ago that a hash table was already in the special territory of data structures.

Edit: I should add that I'm not certain this idea would be better. Just more ideas on what you could do.

There’s a third party crate for this. C++’s STL also doesn’t have it by default.

The creation-order-preserving hash map is basically two data structures in one, it’s more complex and slower than a normal hash map, and IMO it’s rather bizarre and I have never really understood why anyone wants one.

I would be surprised if it was dramatically slower, all told. Though, fair that I have not benchmarked in a long time, here.

My main "idea" here is to stop from having to check all of the headers on a "per header" basis. Yes, you can make each check faster, as the TRIE does. You can also remove the entire reason to check individual headers by starting a traversal after where the internal headers are.

You could also go with something like making sure the first header is "InternalHeaderCount: N" where you then keep the internal headers segregated form the others by a simple count. (This assumes you have solid control over what is adding the headers internally, of course.)

(I also forgot to give kudos on the blog title. Any Die Hard reference is a good reference. :D )

It's especially unfortunate that the intuitive name, `std::map`, is that ordered and generally least useful option of the standard library's three hash map containers.

My last job did rely on ordered-ness of `QMap`, because the elements are listed in a Qt GUI and it could confuse users if those widgets randomly rearranged themselves. That's the only use-case I've personally encountered for an ordered hash map.

X years ago the manager of the cache team raised the lack of data plane and control plane separation as an issue and we came up with various plans to fix things, but I guess nothing has happened since

This might of course introduce quite a bit of overhead, but the clean solution would arguably be to not mix requests to begin with, the external requests you are processing are the payload of the requests you are sending around internally. This would also allow you to cook up an optimized encapsulation protocol that might be more efficient to process than parsing and modifying HTTP. This admittedly comes from someone with essentially zero knowledge of what exactly Cloudflare is doing to the requests they receive.

Or perhaps even, insert yet another header with just the list of internal headers being added to the request, assuming this happens at a single place, otherwise a recipe for disaster.

I have a slightly different example of this, where a rpc framework used in my company disallows the service owner from modifying certain headers (say request identifier), and will instead create a duplicate header with a suffix. In that scenario at least, I can see this as a fairly reasonable tradeoff, as the goal is to control certain headers from being modified, not because they are platform internal but because there are certain assumptions associated with those headers that are followed company-wide.

I'll go check what mechanism is used for this matching.

> yet another header with just the list of internal headers

Or the same but with a list of headers which AREN'T internal.

You'll probably have a custom header-adding function that people should always use instead of the regular one. And this way, if someone forgets to use it, their header will get stripped.

You can think of a header escaping the internal network as something that needs to be authorized. This is a deny by default approach.

This doesn’t work if the front end is older than an internal node and a header gets added, at least not without extra logic to patch everything up and handle conflicts.

The HTTP Connection header works like this, and one should generally assume that almost nothing implements it correctly.

It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.

I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.

The CPU is fast, memory is slow.

The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.

A very simple analysis in R:

    > prop.test(c(9, 4), c(1103,1171))

    2-sample test for equality of proportions with continuity correction

    data:  c(9, 4) out of c(1103, 1171)
    X-squared = 1.4915, df = 1, p-value = 0.222
    alternative hypothesis: two.sided
    95 percent confidence interval:
    -0.002409831  0.011897193
    sample estimates:
        prop 1      prop 2 
    0.008159565 0.003415884

A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.

I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.

It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.

One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.

https://en.wikipedia.org/wiki/Bloom_filter

The problem is that bloom filters are good when your dataset is large and calculating several hashes per item is not as expensive as looking it up in the original dataset. Here, they say that even one hash is expensive, so bloom filters would be much slower than the original hash map.

There are significant differences between trie-hard and a Bloom filter. Bloom filters are probabilistic, and use hashing. They're great for when rare false positives are acceptable in exchange for no false negatives, which isn't uncommon, but isn't what's needed here: this is exact, and hashing is the step to beat.

Rather, this is a refinement/variation on Liang's algorithm, used in TeX for storing a hyphenation dictionary. The thesis mentions Bloom's algorithm, which it calls "superimposed coding", and very much hearkens back to a time when memory was often the most precious commodity. I think you'll like it. ^_^

https://tug.org/docs/liang/liang-thesis.pdf

I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.

The point I'm making is that a specially optimized hashing function would probably blow away any trie traversal. When you know the data being hashed before hand it is possible to make custom hash functions that are as simple as a few bitwise operations on the first, middle and last character of the string (just an example).

If they just used the default, that's SipHash which is optimized for security and DOS resistance, not performance. XXh3 is ~10x faster than that and there's some newer ones that are even faster than Xxh3 like GxHash (like another ~1.5x-2x faster).

You would precompute the hash for the constant string list & then compute the hash only for for the set of strings that begin with C/c since "cf-" is a common prefix for those internal headers if I recall correctly, using RawTable to find entries by hash & then comparing against the precomputed value to see if a collision matched before checking a string for equality.

As they said though, the hash function on a string type requires looking at every element in the string. Also, modulo is historically very slow. So, they still have to deal with O(n) operations, where n is the length of the input string. If they can improve the memory hopping problem associated with lists / graph structures (which they claim to have done in their library), then a trie could would be much fast enough, which is what they observed. Combined with the fact that they claim that 90% of the time there is a miss when querying the trie, then you exit early a lot, whereas you always need to compute the whole hash on the input string when doing the hash map strategy.

The hash does not need to be computed on the whole string. I pointed this out in my other comment but just as a example: a hash function could be as simple as xoring the first 16 and last 16 bits of the string then indexing a 2^16 array. That means hashing is two pointer offsets and an xor (no modulo required). If there are 100 strings that need to be removed, then ~99% of rejections will be a very fast O(1).

And in the case of a match, a fast hash + memcmp will be way faster than a trie traversal. In fact, according to the trie-hard github readme, std::HashMap is already much faster than their trie implementation when searching for a match.

> As they said though, the hash function on a string type requires looking at every element in the string.

Maybe for the default hash function. As another commenter pointed out, your data may make the following hash very effective: s[0] + s[len(s)//2] + s[-1] which would be very fast. The point being is spending a day seeing if such a hash exists is worth it.

CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).

A hash function can't reject a string with a single test of the first byte. It just can't.

For this application, that's a leg up which a hash won't be able to touch. The rest of the art here is shaving down the constant factor of a trie far enough that the initial boost pays off in real-world performance.

Tries are huge for Scrabble move generators and playing programs, of which I've written a few. The fastest structures are compressed tries with rotated representations of words.

I’ve used them in real code! They are pretty intuitive for some use cases. The last time I used a trie was for address lookup. I even used a generator, haha. Double whammy leetcode in real life situation I guess.

It’s definitely not a common tool I reach for though.

i find myself reaching for a trie/radix-tree from time to time in medium-complexity build tools or code analysis scripts, where I need to store some information about zillions of filepaths, and factoring out the common prefixes saves a ton of memory while being pretty intuitive

Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.

That is what I immediately thought as well. The problem though is that a perfect hash is still going to be O(key length) for the hash function, which they are trying to avoid.

In theory, if they used a regex it would be using a state machine to do the matching, which should have similar performance to the trie- only O(k) in the worst case. But from what I understand regex libraries don't actually build the state machine, they use backtracking so the performance is no longer O(k).

I'm surprised they couldn't find a performant regex library that already existed that uses state machines. It should have similar performance to the trie. But in reality, the performance is affected more deeply by things like memory access patterns and the performance of specific arithmetic operations, so it's impossible really to speculate.

I wonder how much faster it gets when you shorten the maximum internal key length. If you re-map everything to like "cf-1", "cf-2", .. "cf-;" then you'd only need to hash like 4 characters.

Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.

A bloom filter will have the same weaknesses of needing to do a hash. For static filters like this though where you have the set of strings to check against known up front, something like a binary fuse filter could work really well as it's significantly faster than bloom filters when you don't need online construction (you still need to do a hash of course but the evaluation against std::HashMap is flawed since it uses SipHash instead of modern constructions like xxh3 or gxhash).

Considering that each node's filter in their trie implementation is bloom-like already, the solution is awful close. I agree that testing wider swaths of the string at once might be a dramatic accelerator.

A more naive approach might also be warranted. Some frequency analysis of hit/miss for trie nodes might allow them to find specific character positions with a higher miss rate than the first one. Testing such special positions first would speed things up. That assumes, of course, that the header data is fairly regular in nature.

My 2c:

1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?

2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?

It might only be $XX,XXX but it is forever.

Power savings forever.

Lower carbon emissions.

etc

Nice to see a company that cares about making things 1% faster rather than trying to analyse headers with AI to sell me sandals or something stupid like that.

I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.)

That's such a tiny corner case but yet something you can easily publish. Fuelling marketing content to ends with hundred geeks discussing if what CF is doing makes sense or how much they care about perf.

That's million dollars.

I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.

regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.

But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences:

* The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects.

* The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.)

* The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route.

* The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance.

* trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie.

Nope. That doesn't usually help in practice because it's not hard to amortize construction with a std::sync::LazyLock.

(The `memchr` crate doesn't even support const construction of a single substring searcher. I haven't spent much time thinking about it because `const` is still pretty limited in what it supports, although progress is being made. For example, you can't use trait methods in `const`. So while you can undoubtedly use `const` Rust to build a single substring searcher, whether `const` Rust can be used to build the one in `memchr` specifically isn't clear to me. And if it could be done, it will undoubtedly come with development/code-clarity costs and it's not clear to me that those are worth paying. Because as I said, amortizing construction for a static set of strings is usually very simple to do.)

And in any case, "takes a long time to build and uses a lot of memory" isn't avoided by doing it at compile time. Instead, you get bloated binaries and longer compile times. Which costs matter more? Dunno, especially when you amortize construction at runtime.

Yeah I guess I was mainly just asking "is const powerful enough yet for this to be a trivial addition".

> Instead, you get bloated binaries and longer compile times

Longer compile times probably are solved by having the strings live in a standalone compilation unit which the compiler will end up caching and you can only dirty it by adding a new string. Needing to shuffle that extra data around might still cost extra compile time, but in a real world program I suspect that's likely to not matter (both in terms of binary size & that other things in the program will be more expensive to compile).

That being said I agree that doing a LazyLock is probably sufficient and given the amount of entries being compared against acquiring the lazy lock probably also gets amortized (it's cheap but not free).

This is a gratifying read (along with the article) because I've been working on closely related things, and had come to the tentative conclusion that taking Aho-Corasick, anchoring it (so no failure transitions), and laying it out in contiguous memory, basically results in Liang's packed trie from TeX's hyphenation algorithm.

trie-hard seems to be effectively that, but with some clever masking to get mileage out of having a limited byte vocabulary to deal with.

The DFA might pull ahead, but I wouldn't just bet on that. Branch predictors don't like lookup tables much, and the memory locality hit is small but real: more of a packed trie fits in a cache line than would a DFA.

You seem to have landed on similar conclusions here: beating a packed trie for this specific use case is going to be tough.

I also like that they hit on the same popcount-and-shift technique I've used in a data structure for UTF-8 character sets: https://github.com/mnemnion/runeset. I admittedly haven't benchmarked this much, correctness first and all, but seeing that Cloudflare has seen good numbers for something closely related reinforces the few tests I have run.

As a text-matching connoisseur, you may find the runeset interesting. It turns out to be hard to search for papers on "UTF-8 character set", so if you're aware of any prior art here I'd love a heads-up. It's fundamentally a generalization of using u64 bitmasks for sets of ASCII, which is as old as the hills at this point.

RE DFA: possibly, but not completely obvious to me. The DFA in aho-corasick doesn't typically have 256 different transitions per state. It uses "byte classes" to represent a typically smaller number of equivalence classes. Two bytes are in the same equivalence class if a match/non-match cannot be distinguished between them.

To be clear, I'd bet trie-hard is still more space efficient. And using byte classes does require an extra look-up per transition.

But, worth measuring.

As your your RuneSet thing, I'd look at lightgrep and icgrep. I'm not 100% they have relevant work for you, but they might. In practice, I just combine sets of codepoints down into UTF-8 automata since that's needed anyway for lazy DFA matching.

> not completely obvious to me .. worth measuring.

I concur.

> lightgrep and icgrep

These are great tips for a "related work" section, but nothing like prior art as it turns out. So yes, relevant and appreciated.

> UTF-8 automata .. lazy DFA matching

I wouldn't expect RuneSet to be useful for any automata-based regex pattern matching approach, there's not a good way to incorporate the context switch, among other considerations, and while it's fast, I will be surprised (very pleased, but surprised) if it's faster. It's designed as a component of an LPeg style parsing machine, where it can be added inline and treated as an unusually large instruction. I haven't done more than time its test suite, which has shown that it's fast enough for the intended use, and that optimization time is best put into other parts of the design for now.

> Internally, we call the request’s destination server its “origin”

Origin: The point at which something comes into existence or from which it derives or is derived.

How can the request's destination server be the origin, if it is the destination server?

On the internet, the origin is the server sending the response to the user. I suppose you can look at it from the perspective of the owner of the server -- from their frame of reference, their journey _starts_ when they receive, process, and respond to the request.

Granted, computer scientists are infamously known for being terrible at naming things.

Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.

I will note that http requests and internal requests are distinct.

The Zanzibar white paper from Google in 2019, for example lists top-line rpcs to Zanzibar at around 10m qps, but internal (in memory) cache hits at 200m qps and non cache internal fan-out at >20m qps. Zanzibar of course isn't http, but is a good public example of internal rpc fan-out.

Id expect cloutdflare similarly has some extremely high qps internal services.

Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?

Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.

Shouldn't this line: > "using regular expressions clocks in as taking about twice as long as the hashmap-based solution." be written as the opposite? That the RE takes half as long/is twice as fast?

I think the point is that naive regex are a very generic purpose tool, but it's still in the same ballpark. Having a custom optimized state machine for this specific use case could bring another 5x improvement on top, leading to 2.5x faster, potentially.

Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.

> You could have a header that differs from an internal one only on the last character and then you have to read the whole thing

When people talk about BigO they usually talk about average Big-o and worst/best case Big-o is explicitly called out if mentioned, so you can't just think in terms of the worst possible case. Regardless I think they made several mistakes here as you note correctly.

BTreeSet is log(N) in number of nodes compared. But for strings that comparison is O(L) since you at least have to compare it to the string you find. So I believe it's at best O(L log(N)) where L is the average string length and N is the number of nodes which is worse than the hash table which is just O(L). It's tricky though if most strings don't match your string. In that case I believe you end up degrading to an O(L) + O(log(N)).

Similarly, you are correct that they can't be logarithmic in size and must be linear since you have to store the data. Tries can be sublinear depending on the input since it's a form of compression as well.

Similarly you are right about trie complexity not being O(log(L)) for trie misses. I think what they're observing is a massive speedup because mismatches error out on the first character usually. But it wouldn't be logarithmic as there's unlikely to be a logarithmic relation between headers that mismatch and the universe of matching words.

Thanks for replying, I really appreciate you taking the time to check my concerns! I was afraid I was going crazy haha.

The part about big O being usually about the average case was helpful, I'm still at uni where we mostly talk about worst case performance.

I'll note I could very well be wrong - very back of the paper first principles derivation and could easily have made a mistake.

If you're in a relevant course, might be a good opportunity to flag this to a professor or TA as a way to review. These kind of analyses can be tricky sometimes, especially if you need to prove it vs just do an answer by feel which is how engineers end up doing it (vs CS students that take the proofs much more seriously).

A way to sometimes double-check is to just simulate performance as input grows and see if it matches the expected log(N).

Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.

No, you'd just be smearing the work at best. The expensive part is determining set containment not updating the data structure. It doesn't matter if you do it in a loop and update the data structure or do that check before writing.

I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?

> why not show the full bit pattern in memory?

Rust doesn't guarantee a particular memory layout except for types explicitly requesting that you get a specific representation, so showing this would at least require a caveat that it's not guaranteed to be accurate.

Furthermore, this specific data structure's internals are generic, so to show the memory layout, it would have to be for a specific instantiation.

(They do also provide a structure that includes instantiations between u8 and u256, mentioning in a doc comment that the 256 size dominates overall size, and so if you want to slim it down, use the internals directly.)

By looking at the source code [1] they have a discriminated union with 3 variants representing the possible combinations of representing a valid word and leading to more nodes (excluding the case where it is neither, which is impossible). So the node for the 'o' in "do" should be a `SearchOrLeaf` storing "do", its corresponding value (the `T`, in a set this would be empty) and the `SearchNode` containing the successors, which should only contain a 't' to "dot".

[1]: https://github.com/cloudflare/trie-hard/blob/3d8163ca2d9208c...

I would have stopped at the first step, saving almost 1% of cpu time (0.993%). Creating a new fast trie data type for the other 0.287% seems inefficient. Surely there are other parts of the code that take more than 1% of the cpu time elsewhere to be looked at first?

From the article:

> At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths...function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Given how many requests are going out, and just how hot the function is, every single gain matters. If my read of the article is correct, this is the hottest part of the hot path, so every gain here is super important.

680 cores is peanuts. That is less than 10 modern specced compute servers. Its not nothing but you are probably talking about $20-30k/year in opex costs, which at the scale of cloudflare is probably a drop in the bucket.

Obviously it is an improvement but a lot of time there is lower hanging fruit, like the fact that the cloudflare control plane is/was not globally distributed and could not tolerate data center outages.

Idk man, I’m not Cloudflare.

That said: probably different teams with different priorities.

I also disagree that “it’s _only_ $20-30k” as some kind of reason to not do this kind of work. Think of it another way: that’s now another 680 cores that can do _net more work_, that’s another 20-30k we don’t have to spend when traffic increases. “I landed a small PR that wiped $30k net off our costs” is a statement that would make a lot of people in a business pretty happy.

I work in a tech company which is a bit bigger than cloudflare and anything less than 6 figures/yr opex reduction is a rounding error. It is pretty regular to save the company 7 figures per year when opex is in the billions.

It is about the opportunity cost, nobody is going to disagree that small optimizations are a net positive but what else could you have done with your time? Maybe cloudflare is at the point of maturity where there are no easy wins and only 1% gains left but I kind of doubt it given their reliability track record. How many dollars do they lose for every hour they are down? Probably more than $20k.

Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial.

They don't explicitly say that in the blog post but I assume that the engineering team behind this code does not run any workload performance experiments themselves but they only run the local (micro)benchmarks.

Because of this I think there's no way for them to answer your question or to say "hey, this could very well be a vector since we could not find any statistically significant impact between a vector, hash-map and trie in this code path for given workload".

This is a common fallacy I've seen in too many places. In other words, investing non-trivial engineering effort into optimizing a code-path that resembles .000001% of your workload is a waste of resources.

And due to the Amdahl's law

> Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

this is not how things usually work out in the end. Making small improvements in insignificant code-paths will still remain insignificant in the overall runtime performance.

> If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted

Look at the end where they validate that the performance wins show up in production in a similar way that the microbenchmarks predict.

They only measure the CPU utilization of clear_internal_headers and they don't show what was the impact of the change to the overall runtime performance. The latter is what I was referring to.

Given it's such a hot part of their code, I'm genuinely surprised they didn't go down the state machine route for as optimal a solution, or even further and made use of vector instructions. Maybe Rust can autovectorise the code though.

It’s a bit wild to me that one needs to add so many http headers internally that they need to get stripped off.

The “trie-hard” crate is a solution to the layers of cruft at CF

So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?

What about for the other integer sizes that the trie supports?

This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!

...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/

（评论） (comments)

（评论）
(comments)