物化视图显然很有用。
Materialized views are obviously useful

原始链接: https://sophiebits.com/2025/08/22/materialized-views-are-obviously-useful

## 派生数据的陷阱 程序员们经常面临高效管理和同步派生数据——从核心数据源计算得出的信息——的挑战。一个简单的例子:在应用程序中显示每个项目的任务数量。最初,一个简单的SQL查询就足够了,但随着数据集的增大,性能问题很快出现,导致实施Redis缓存。 虽然缓存提高了速度,但它引入了数据一致性问题。任务的更新不会立即反映在缓存的计数中,需要在任务创建/删除时进行复杂的增量更新逻辑。这很快会升级,处理诸如任务移动和服务器崩溃期间潜在数据丢失之类的边缘情况。维护准确性需要越来越复杂的(且容易出错的)分布式系统方法,例如Kafka用于重试或数据库触发器。 最终,最初的简单性在一张专门用于数据同步的代码网络中消失了。这掩盖了核心应用程序逻辑,并引入了随着时间推移而积累的细微错误。 幸运的是,新兴技术,如“增量视图维护”或“差分数据流”,提供了一个有希望的解决方案。这些系统允许开发人员使用标准的SQL定义所需的派生数据,然后自动处理保持其一致性更新的复杂性——本质上是一个*保持*物化的物化视图。这会将同步的负担转移到数据库,让开发人员可以专注于应用程序逻辑。

## 黑客新闻讨论:物化视图与增量维护 一场黑客新闻讨论围绕数据库中物化视图的实用性和实现展开,起因于一篇相关博客文章。核心问题在于,虽然从概念上讲,物化视图对于加速查询很有价值,但许多数据库——尤其是PostgreSQL——缺乏对这些视图的自动、增量维护功能。这意味着它们需要手动刷新,这可能会消耗大量资源。 用户抱怨PostgreSQL缺少此功能,尽管多年来已有提案,并讨论了替代方案,例如扩展(但对它们的采用持保留态度)或手动管理刷新表。对话强调,其他数据库系统,如Oracle,以及较新的解决方案,如Materialize和Snowflake,*确实*提供自动增量更新。 一些评论员指出权衡:物化视图会引入缓存失效问题。另一些人建议使用触发器或表分区等替代方法。一个关键的收获是,专注于“增量视图维护”或“差异数据流”的初创公司的出现,作为一种解决方案,一些用户提倡这些新技术。这场讨论强调,物化视图并非“魔法”,需要仔细考虑实现细节和潜在的性能影响。
相关文章

原文

As programmers we spend a lot of time shuttling data back and forth between different systems and transforming it from one format to another. Sometimes it gets pretty miserable!

Let’s say you’re making a fancy task tracking app. Tasks belong to projects, and on the projects page, you want to show how many tasks are in each project. Maybe you start with a little SQL that you call from your “get info about project” view model code:

async function getTaskCountForProject(projectId) {
  return await db.query('select count(1) from tasks where project_id = $1', [projectId]);
}

Wow! So easy.

Uh oh, someone is tapping you on the shoulder and saying this is too slow because it has to do a complete index scan of the tasks for the project, every time you load the page. That’s fine, we’ll just put a… Redis cache on it?

Putting a cache on it

async function getTaskCountForProject(projectId) {
  const key = `project:${projectId}:task-count`;
  const redisCount = await redis.get(key);
  if (redisCount !== null) {
    return +redisCount;
  }

  const count = await db.query(
    'select count(1) from tasks where project_id = $1',
    [projectId],
  );
  await redis.set(key, count, { ex: 3600 }); 
  return count;
}

Works great. It’s fast enough. (Technically speaking, if 100 people load the same page at the same time and the cache isn’t populated yet, then we’ll end up sending 100 queries to the database which isn’t amazing, but let’s just pretend we didn’t hear that.)

Unfortunately our users are complaining that the count is wrong a lot of the time now? Like they add a new task or delete a task and the count doesn’t change and they’re confused. Reasonably so.

I guess we could clear the cache entry whenever we create or delete tasks. But really it would be better to not scan the whole list of tasks whenever we need a count. Some projects have many thousands of tasks in them!

Incremental updates

So let’s do something smarter. It’s not necessary to recompute the count from scratch every time. When we’re creating a task, we’ll just increment the count; upon delete, decrement.

async function getTaskCountForProject(projectId) {
  
  await redis.set(getProjectTaskCountKey(projectId), count); 
}

async function createTask(task) {
  await db.query('insert into tasks ...', ...);
  await redis.incr(getProjectTaskCountKey(task.projectId), 1);
}

async function deleteTask(task) {
  await db.query('delete from tasks ...', ...);
  await redis.decr(getProjectTaskCountKey(task.projectId), 1);
}

Well actually that’s not quite right because if the count is somehow missing from Redis then we don’t want to set it to 1 when incrementing. So it’s actually more like

async function incrByIfExists(redis, key, by) {
  await redis.eval(`
    if redis.call('EXISTS', KEYS[1]) == 1 then
      return redis.call('INCRBY', KEYS[1], ARGS[1])
    end
    return nil
  `, [key], [by]);
}

async function createTask(task) {
  await db.query('insert into tasks ...', ...);
  await incrByIfExists(redis, getProjectTaskCountKey(task.projectId), 1);
}

async function deleteTask(task) {
  await db.query('delete from tasks ...', ...);
  await incrByIfExists(redis, getProjectTaskCountKey(task.projectId), -1);
}

This is fine. I mean, it’s super annoying that we have to do this, but this is what we’re getting paid to do.

Oops. There’s an incoming bug report that the counts are wrong when our users move tasks between projects. Probably should’ve foreseen that. Kind of a pain that we need to worry about this on every update path now. Oh well, when a task moves between projects we’ll just increment the new project’s count and decrement the old one.

And rebuild all the stored counts to correct for our past mistakes.

Now it works well.

Right?

You deploy this to prod and it works great for several months. But then there was that incident last week where the servers were crashing because they were running out of memory, and ever since then it seems like there are a bunch of projects whose counts are just off by 1 or 2. After some sleuthing, you realize that some of the servers crashed after writing the row to the database but before writing to Redis, so now the counts are just wrong forever.

If you really care about the project task counts being accurate, I guess this is where you pull in Kafka and make everything retriable and idempotent so that your different systems can stay in sync.

Or maybe you store these counts in your SQL database and update them in the same database transaction? That’s what the A stands for, after all. It’s probably fast enough for that. Having logic in your database is out of vogue these days, but you could even use triggers so that the database guarantees that whenever an insert happens on the tasks table, it runs your SQL code to increment the count.

I don’t know about you, but for me it’s pretty annoying when the correctness of my system today depends not only on the code being correct right now but also on my code having done the correct thing at every point in the past. Way too easy to end up with little bugs whose errors accumulate and come back to bite you months later!

Isn’t there something better?

I miss the code that we started with. It was one line of code instead of dozens and also it was actually correct, without room for subtle errors and without me needing to be an expert in writing distributed systems.

In most applications I’ve worked on, there are thousands and thousands and thousands of lines of code just doing this sort of “derive some data and keep it in sync with the source” type of thing. It’s tedious and also it makes it way harder to see what the essential complexity of the application actually is and to refactor over time.

There are a few startups these days peddling a newfangled technology called “incremental view maintenance” or “differential dataflow”. Basically the way it works is you just say “hey, I’d like to keep track of how many tasks each project has” by writing any SQL query you want:

create materialized view projects_task_count as
select project_id, count(1) as count
from tasks
group by project_id

And then ✨ by magic ✨ the results of this query will just always exist and be up-to-date. You can just query it and it’s instant; the database doesn’t need to iterate over every task in the project in order to produce your answer.

select count from projects_task_count where project_id = $1

The “magic” is actually really cool. Basically the SQL query is analyzed to produce a DAG of the data flow with different nodes for filters, groups, joins, etc, and then each node “knows” how to map any change in its input to the appropriate change in the output.

In this particular example, the graph will know that if you insert a task, it needs to increment the corresponding count; for deletes, a decrement; for updates, it’s a no-op unless project_id changes, in which case it knows that the row now gets assigned to a new group and updates the counts accordingly.

I don’t know yet if the implementations of this yet are good enough to use at scale. Maybe they’re slow or maybe the bugs aren’t ironed out yet. But obviously it’s possible to build a system that takes an arbitrary declarative, stateless query and does this sort of static analysis and incremental computation behind the scenes, and it should be possible to make it fast and reliable. And if you can make it good, it’s obviously extremely useful. You get to cut out all of that application code that’s dealing with keeping things in sync and make dealing with the stateful data updates someone else’s job. And ideally you can change the performance characteristics (eg: how much to store in memory, whether the result needs to be updated immediately or can be deferred, etc) without rewriting all your code that actually computes the answer.

It’s too good of an idea for it to not succeed. Certainly if I was in charge of databases at AWS, this would be a major tentpole for my roadmap! I figure that a decade from now, most database systems will have a version of this built in.

I can’t wait until they do.

联系我们 contact @ memedata.com