My Database Is My Application: Rethinking Webhook Logic with DuckDB and SQL

原始链接: https://www.hey.earth/posts/duckdb-webhook-gateway

This project explores using DuckDB as a runtime engine for processing webhooks, shifting from traditional code-centric solutions to a data-driven approach. Instead of relying on Python handlers, the system stores webhook transformation logic as SQL queries within DuckDB. This enables self-service: teams can define and update their transformations via an API or UI without redeploying code. Key advantages include declarative self-service, queryable runtime state for debugging, composable logic through SQL JOINs and Python UDFs, a portable artifact (a single DuckDB file), and "infra as data." The system provides API endpoints for managing webhooks, reference tables, and UDFs. It handles concurrency safely using async Python and transactions. While not optimized for simple "hello world" webhooks, it addresses challenges like code bottlenecks, ownership tangles, repetitive logic, and observability. The architecture allows for rich audit trails and data-driven control planes, paving the way for self-service ETLs, extensible audit layers, and dynamic caching, ultimately highlighting the potential of databases as runtime logic engines.

This Hacker News thread discusses an article advocating for using DuckDB and SQL to manage webhook logic, essentially treating the database as the application. A user named Beestie shares their experience of building a similar system five years ago, storing API payloads directly and processing them with SQL for stability and ease of modification. Other commenters echo this sentiment, with one user noting PostgreSQL's speed as an authentication layer. Another highlights DuckDB's potential when combined with type-safe Java libraries like Manifold. One commenter questioned some discrepancies in the source path and filter query parameters in the article steps. Overall, the thread shows support for the idea of leveraging SQL and database technology for more than just data storage, using it as a central processing engine for web application logic.

原文

Imagine you need to build a system for processing incoming webhooks. You're probably picturing a familiar setup: a lightweight web server (FastAPI, Flask, Express.js, etc.), some Python (or Node.js, or Go) handlers to parse JSON, a sprinkle of business logic, and then maybe persisting data to a traditional database like PostgreSQL or MySQL. Perhaps you'd toss events onto a message queue like Kafka or RabbitMQ for downstream processing. Standard stuff, right?

Well, I’ve been experimenting with a different approach. What if I told you I let SQL handle almost all of it?

Sounds a bit... unconventional? Maybe. But it’s an attempt to solve some persistent challenges I've encountered, and it opens up a fascinating way to think about data, logic, and infrastructure. This isn't just a backend API; the project also includes a simple web UI to manage these configurations visually, making the whole system tangible. Code available here (opens in a new tab)

The Familiar Friction: Why I Started Questioning "Normal"

I've built and maintained my fair share of webhook gateways and integration layers. A few common pain points kept cropping up:

The Code Bottleneck: Every time a new webhook source or a slight variation in transformation logic was needed, it meant code changes. A new handler, a modified Pydantic model, a redeploy. I, or my team, became the bottleneck.
Ownership Tangles: Giving multiple teams or users the ability to define their own webhook transformations often meant granting them broader application deployment privileges, or setting up complex, isolated microservices for each. Neither felt quite right.
Repetitive Logic: So many webhook handlers do similar things: pick a few fields, rename some keys, maybe enrich with a static lookup. Writing Python for output_payload['userName'] = input_payload['user']['login'] over and over felt like I was missing a more declarative way.
Observability Challenges: Understanding why a specific webhook failed or was transformed in a certain way often involved digging through application logs, which could be scattered or inconsistently formatted.

These issues led me to wonder: could there be a more data-centric, self-service approach?

Shifting the Lens: DuckDB as a Runtime Engine

This is where my mental model for DuckDB started to shift.

Think about it:

We trust SQL to define data contracts and schemas (hello, dbt users!).
We trust SQL to express complex aggregations, filters, and joins for analytics.
SQL is declarative, its behavior is (usually) inspectable, and its syntax is widely understood.

So, why not trust SQL to express the live transformation and routing logic for something like webhooks? The key insight for me was this: if the logic is defined as data (SQL queries stored in a table), and the engine to execute that logic is also data-aware (DuckDB), then you unlock a powerful new paradigm.

The Core Idea: The Database is the Application Logic

In the system I've built, the database isn't just a passive recipient of data; it's the active processing pipeline.

The database isn't just a data sink; it's the pipeline itself. It’s almost like turning your middleware inside out.

Why This "Weird" Approach is Powerful

This might seem like an academic exercise, but it unlocks some genuinely compelling properties.

✅ Declarative Self-Service: Teams can define and update their webhook transformations and filters by submitting SQL queries via an API (or the provided UI!). A full set of API endpoints allows managing webhooks (register, update, list, activate/deactivate, delete), reference tables, and UDFs.

No application redeploys are needed for logic changes. They own their logic.

✅ Queryable Runtime State: Every raw event, every transformation attempt, its success or failure, and the final payload are logged within DuckDB.

Need to debug why a webhook for /github-events isn't working? The /query endpoint or direct DB access lets you run SQL like:

This level of direct, SQL-based introspection into the runtime behavior is incredibly powerful. Specific API endpoints also exist to view recent events and drill down into individual event details.

✅ Composable & Extensible Logic:

Need to enrich webhook data with user details?

Upload a users.csv or JSON file as a reference table (/upload_table) and JOIN it in your transform query using its namespaced name (e.g., ref_webhook123_users).

Need a complex string manipulation or a call to an external validation service that SQL can't easily do?

Register a Python UDF (/register_udf) and call it directly from your SQL using its namespaced name (e.g., udf_webhook123_extract_jira_key).

✅ Portable Artifact & Simplified Operations: The entire state of the gateway - configurations, reference data, UDF definitions, and logs - can live inside a single DuckDB file (though the path is configurable via the DUCKDB_PATH environment variable). Backups are simple file copies.

✅ "Infra as Data": The behavior of the system is defined by data (SQL queries, reference tables) stored within the database, rather than by imperative code deployed in an application layer.

✅ Robust Implementation: Under the hood, it uses standard Python async capabilities (asyncio, FastAPI) along with thread pools and locking (ThreadPoolExecutor, asyncio.Lock) to handle requests concurrently while ensuring safe access to the DuckDB database. Database operations are wrapped in transactions for atomicity.

Addressing the Skepticism: "But Why Not Just... Python?"

And those are valid points if your primary goal is just to write a simple webhook handler quickly. But this approach isn't trying to optimize for the simplest possible "hello world" webhook. It's aiming to solve a deeper set of problems around making runtime behavior highly configurable via data, and shifting who owns the transformation logic. It's about enabling a more federated, self-service model.

A Concrete Example: Let's Walk Through It

Imagine we want to process GitHub push events for the main branch only. Here's how it works:

This simple flow, driven entirely by SQL configurations stored in the database, demonstrates the core power.

Where Could This Go? Open Possibilities

It's primarily an exploration. But it’s an exciting one because it forces us to invert the typical application stack.

I see this pattern as a prototype for thinking differently about:

Data-driven control planes: Where system behavior is dynamically configured through data.
Self-service data pipelines: Empowering users to build their own simple ETLs with SQL, reference data, and custom Python logic.
Extensible audit layers: Creating rich, queryable logs of system activity by default.
Dynamic caching layers: Imagine SQL defining cache invalidation or transformation logic.
Per-tenant customization: Using SQL to define tenant-specific routing or data shaping.

It’s not just about this specific webhook gateway. It’s about advocating for a way of thinking: that databases, especially modern ones like DuckDB, can be potent runtime logic engines, not just passive storage layers. And that sometimes, "weird" architectures unlock remarkably useful properties once you reframe the problem you're trying to solve.

I'm curious to hear what others think. What are the pitfalls I haven't considered? What other use cases could this "database as runtime" pattern unlock? Let me know!

not made by a 🤖