(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39742916

本文讨论了提高 PostgreSQL 数据库中 LLVM 即时 (JIT) 编译器可用性的挑战。 目前,为每个查询生成自定义代码的过程会因重复编译而导致性能问题,并降低多个连接之间共享资源的能力。 虽然 LLVM 支持缓存和重用已编译的查询,但主要问题在于 PostgreSQL 本身。 具体来说,它与 PostgreSQL JIT 编译器的实现有关,该编译器需要注入直接内存地址,导致生成的代码特定于每个唯一的查询和进程。 作为解决方案,作者建议研究如何将 PostgreSQL 转换为与 Microsoft SQL Server (MSSQL) 或 Oracle 类似的行为,从而允许共享内存和多线程架构,从而实现 LLVM JIT 编译器的高效使用。 此外,作者还分享了在以前的系统中处理查询性能的个人经验,包括 Oracle 9i 及其共享查询缓存系统的限制。 这篇文章最后呼吁社区在相关开发者会议期间进一步讨论此事。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
I wrote a new JIT compiler for PostgreSQL (pinaraf.info)
278 points by mattashii 1 day ago | hide | past | favorite | 35 comments










> There is one way to make the LLVM JIT compiler more usable, but I fear it’s going to take years to be implemented: being able to cache and reuse compiled queries.

Actually, it's implemented in LLVM for years :) https://github.com/llvm/llvm-project/commit/a98546ebcd2a692e...



Yeah, well, sorry, I should have been more explicit here: the issue is with PostgreSQL, not LLVM. The JIT compiler has to inject direct memory addresses, making the generated code specific to your query and process.


Interesting, because we store relocatable objects. And process symbols can be resolved by name if you really want. It might be yet another performance trade-off though.


Since PG uses one process per connection and the LLVM JIT code is process specific the code can't be shared amongst all connections to the DB.

Plans themselves suffer from this since they are in memory data structures not designed to be shared amongst different processes.

DB's like MSSQL don't have this issue since they use a single process with threads which is also why it can handle more concurrent connections without an external pooler. Although MSSQL can also serialize plans to a non process specific representation and store them in the DB for things like plan locking.



> Plans themselves suffer from this since they are in memory data structures not designed to be shared amongst different processes.

Oracle uses a process-per-connection model as well (at least on Linux), and they are able to share execution plans across connections. They put all the plans into the "global" shared memory.



Looks like you can change that with THREADED_EXECUTION to make it act like it does on Windows with a single process and threads:

>On UNIX, starting with Oracle Database 12c Release 2 (12.2), Oracle Database can use an operating system process or an operating system thread to implement each background task such as database writer (DBW0), log writer (LGWR), shared server process dispatchers, and shared servers.

The use of operating system threads instead of processes allow resource sharing and reduce resource consumption.

On Windows, each background process is implemented as a thread inside a single, large process.

https://docs.oracle.com/en/database/oracle/oracle-database/1...

Processes in Windows are much more expensive than Unix typically so using threads has always been preferred to multi process, perhaps thats why MSSQL only has that option with an almost fully recreated internal process model that you can list and kill etc.

Even Oracle says it helps with resource usage, even on Unix/Linux. Also looks like Oracle has had some kind shared mode for a long time where it basically has a built in pooler to keep actual OS process count down, not 1:1 like PG.

Sharing plans can obviously be done using shared memory but it's not a simple as just creating some C++ object model (which I believe is what PG has internally) for the plan it must have a process agnostic data format that is then executed probably by deserializing into a executable model from shared memory. Fully jitted code is even trickier vs just a set of logical plan operations. With threads you just share executable code.



Way back on Oracle 9i, we had a mystery stall problem. We couldn’t saturate the network, the CPU, or the fiber channel links. We were stuck at ~50% and stumped. Some fuckery was going on and we had to call in a professional.

Turned out 9i could only run queries that currently resided in the query cache, and some idiot (who was now my boss) had fucked up our query builder code so that we were getting too many unique queries. Not enough bind variables.

So it’s clear Oracle was using a shared cache back then, but like other people here, I’m scratching my head how this would work with Postgres’s flavor of MVCC. Maybe share query plans when the transaction completes?

I feel like that would get you 90% of the way but with some head of queue nastiness.



naive q: could one process perform the compilation, store the output in postgresql shmem, then each pg worker copy that code locally ?

can plans be shared ? I might be mistaken but I thought each worker backend can (e.g.) be assigned to a different partition in a partitioned table, with very different table indexes, statistics, etc.



>naive q: could one process perform the compilation, store the output in postgresql shmem, then each pg worker copy that code locally ?

It would probably be easiest to just compile the code to a shared library on disk then it would get memory mapped in all processes using it like any normal shared library, each process must be remapped through whatever binding process to the actual memory addresses.

>can plans be shared ? I might be mistaken but I thought each worker backend can (e.g.) be assigned to a different partition in a partitioned table, with very different table indexes, statistics, etc.

I don't know the details but the PG docs say this about parallel plans, which seems to say each process has its own individual part of the plan:

>Because each worker executes the parallel portion of the plan to completion, it is not possible to simply take an ordinary query plan and run it using multiple workers. Each worker would produce a full copy of the output result set, so the query would not run any faster than normal but would produce incorrect results. Instead, the parallel portion of the plan must be what is known internally to the query optimizer as a partial plan; that is, it must be constructed so that each process that executes the plan will generate only a subset of the output rows in such a way that each required output row is guaranteed to be generated by exactly one of the cooperating processes.

https://www.postgresql.org/docs/current/parallel-plans.html



Author here. Thanks for submitting my article on hackernews. I'll do my best to answer any question.


Thanks for the blog post, it's always nice to see performance improvements in Postgres! I'm curious: how much time is spent for LLVM on some real queries and how is LLVM configured (i.e., which passes, which back-end optimization, etc.)? In our experience [1], LLVM can be reasonably fast when optimized for compile time without optimizations and the -O0 back-end pipeline, but obviously still 10-20x slower compared to other approaches.

Also, in our experience, copy-and-patch-generated code tends to be quite slow and hard to optimize (we tried some things [2; Sec. 5], but there's still quite a gap (see Fig. 3 for a database evaluation)). Do you have some numbers on the run-time slowdown compared to LLVM? Any future plans for implementing multi-tiering, so dynamically switching from the quickly compiled code to LLVM-optimized code?

[1]: https://home.in.tum.de/~engelke/pubs/2403-cgo.pdf [2]: https://home.in.tum.de/~engelke/pubs/2403-cc.pdf



Is there a fundamental difference between copy and patch with C and what compilers do when they target intermediate representations? It seems to me that traditional compilation methods are also "copy and patch" but with another intermediate language than C.


I think conceptually, there is no real difference. In the end, a compiler outputting machine code uses very small stencils, like "mov _ _", which are rather simple to patch.

Practically though, it's an enormous difference, as the copy and patch approach re-uses the years of work going into clang / gcc supporting platforms, optimizations for different platforms and so on. The approach enables a much larger pool of people ("People capable of writing C" vs "People capable of writing assembly / machine code") to implement very decent JIT compilers.



The real difference is in the possible optimizations. If you consider the full scope of JIT compilation in for instance a web browser or the JVM, you could use copy and patch as a tier 0 compiler, and once really hot paths are identified, trigger a complete compiler with all the optimizer steps. Some optimizations are more complicated to implement with copy-patch, esp. if you can't use all the tricks described in the paper (for instance they use the ghccc calling convention to get a much finer register allocation, but from the documentation I don't think it's going to make it for PostgreSQL).

But as you say, yes, this enables people capable of writing C and reading assembly (or you have to be perfect and never have to go into gdb on your compiled code), and it makes the job so much faster and easier... Writing several machine code emitters is painful, and having the required optimization strategies for each ISA is quickly out of reach.



Is copy-and-patch really a new idea, or just a new name for an old idea?

When I learned programming (and interpreters particularly) around 2010, I thought it was well-known that you could memcpy chunks of executable code that your compiler produced if you were careful ... the major gotcha was that the NX bit was just starting to take off at the time (Even on Linux, most people still assumed 32-bit distros and might be surprised that their CPUs even supported 64-bit. At some point I ended up with a netbook that didn't support 64-bit code at all ...).

Unfortunately I ended up spending too much time on the rest of the code to actually look deeply enough into it to build something useful.



It is an old idea with a new name. For example, QEMU orginally worked like this [1] and they already used relocations to patch in constants, before they later moved to TCG for higher-quality code.

[1]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/fu...



Would be a great topic for pgconf.eu in June (pgcon moved to Vancouver). Too bad the CfP is over, but there's the "unconference" part (but the topics are decided at the event, no guarantees).


Did you mean pgconf.dev in May (which has the unconference), or pgconf.eu in October (which doesn't have an unconference, but the CfP will open sometime in the - hopefully near - future)?


Yeah, I meant May. Sorry :-( Too many conferences around that time, I got confused.

That being said, submitting this into the pgconf.eu CfP is a good idea too. It's just that it seems like a nice development topic, and the pgcon unconference was always a great place to discuss this sort of stuff. There are a couple more topics in the JIT area, so having a session or two to talk about those and how to move that forward would be beneficial.



Not a question, but I love this. I’m eager to see its evolution.


Nice post, thanks! Do I read it right that using jit results in the worst max times? What could be a reason in your opinion?


Two parts: I did the benchmark on a laptop and didn't spend enough time forcing its runtime PM in a fixed state, I'll run a real pgbench on my desktop once I implement all required opcodes for it. And since JIT requires a minimum amount of time (about 300us on my tests), on such small runtimes this can quickly overcome the benefits.


I'm still surprised there isn't a query/plan cache for PostgreSQL. I could easily see these two approaches working in harmony once it does, as frequent queries could end up being cached and more aggressively optimized with a cache to offset the compilation cost. Of course that adds a whole new layer of complexity and trouble.


(The article goes a bit above my head so my excuses if I am a bit off-topic)

There is a form of query plan caching in PG: for prepared statements, if PG determines that the actual value of parameters won't affect the query plan much, it uses a "generic plan" so that it reuses the same query plan for every execution of the prepared statement (https://www.postgresql.org/docs/current/sql-prepare.html, see "notes")



Indeed, and right now it's the only possible way since it remains in a single session, doing otherwise would be very hard.


Unless you count stored procs...


Yes its manual and per session, DB's like MSSQL have that was well but are very rarely used anymore because it got automatic plan caching about 20 years ago which basically eliminates any advantage to manually preparing. Its actually better since it can be shared across all sessions


Honestly I thought the same as you, then I wrote this, and I now understand it's going to be really hard to do. To make it very simple: there are pointers to query parts "leaking" everywhere across the execution engine. Removing them will require a significant overall of the execution engine, the planner and who knows what else. Even in a single session, two compiled queries will have different compiled code because of that (both llvm and my copyjit have to inject the adresses of various structs in asm code)


Just going to say, I'm blown away by how simple this JIT is though. Really quite a beautiful JIT approach.


Same for me, that's why I did this after finding out this research paper. With the proper compiler settings and small tricks you can remove some parts and already end up faster than the interpreter (because you remove some branches and a few memory accesses) and it's even possible to create "super-stencils" covering typical opcodes series and optimizing them further. Or the opposite, "sub-stencils" in order to do some loop unrolling for instance.


The plan cache on Oracle in combination with prepared statements where the optimizer can't peek into the parameters has been really problematic for me in the past. I usually had to go in and either add hints or force a plan.

Even simple queries like SELECT * FROM t WHERE x = TRUE; could turn into a nightmare depending on the distribution of the x values in the table.

With Postgres I rarely encountered such problems but I must admit that I haven't used Postgres with prepared statements.

I have seen some queries with slow planning time (>100ms) where a cache could have been useful but I don't remember ever really needing to optimize one.



Is x a parameter because it doesn't look like it? MSSQL has parameter sniffing and will make multiple plans based on incoming parameters I would be surprised if Oracle does not do the same. It can actually be problematic to sniff parameters sometimes and it can be disabled with a hint!


I just made up something simple but yes, I had these problems with bind variables in prepared statements.

> SELECT * FROM t WHERE x = :var;

But I haven't used Oracle in years and back in the day, there was no bind variable peeking.



Back in 386 era, there was a concept of self-modifying code (assembly). A similar like stencils presented here, but because code was a singleton, rarely a copy was made.

E.g. Doom on DOS used this optimisation techique, because otherwise you could not cram out enough performance from tight rendering loops on old CPUs.



Interesting work, please continue it. Great times ahead. Thanks!






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com