pg_timeseries：PostgreSQL 的开源时间序列扩展

pg_timeseries：PostgreSQL 的开源时间序列扩展
pg_timeseries: Open-source time-series extension for PostgreSQL

原始链接: https://tembo.io/blog/pg-timeseries

PG Timeseries 是 PostgreSQL 的一个扩展，旨在增强处理时间序列数据时的用户体验。此扩展允许创建时间序列表、配置旧数据保留和压缩、监视分区以及使用简单的语法执行复杂的时间序列分析函数。 PG Timeseries 在 PostgreSQL 许可证下运行，可以合并到现有的 PostgreSQL 设置中，也可以在 Tembo Cloud 上的 Timeseries Stack 中使用。时间序列数据是指与各个时间戳相关的数据点的集合。示例包括股票价格、设备/传感器读数和网络流量统计数据。对时间序列数据的查询通常涉及根据新近度进行过滤并通过分析进行汇总。 PostgreSQL 的灵活性使其适合处理不同的数据工作流程。因此，Tembo 的目标是简化利用 PostgreSQL 处理时间序列工作负载的流程。最近，Tembo 开发了许多基于 PostgreSQL 的解决方案，以促进分析、人工智能和操作任务，而无需第三方许可限制。客户普遍要求的是专用的时间序列堆栈。由于 TimescaleDB 许可证的限制，排除了压缩、增量物化视图和扩展存储容量等基本功能，Tembo 选择构建其独特的 PostgreSQL 许可扩展 - pgTimeseries。高效时间序列数据处理所需的核心功能包括无缝处理大量数据、提供时间范围查询的快速答案以及提供高级分析功能。 PostgreSQL 提供了关键元素，如本机分区、各种索引、物化视图和窗口/分析函数。此外，兼容的扩展（例如 pg_partman、pg_cron、columnar、pg_ivm 和 pg_tier）进一步扩展了平台以满足特定要求。 pgTimeseries 通过结合集成元素的功能，成为管理 PostgreSQL 中时间序列数据的最简单解决方案。入门包括建立一个具有类似时间列的表并对其进行相应的分区。之后，使用“enable_ts_table”命令启用扩展就足够了。功能包括通过专用视图查看有价值的表和分区信息。分区策略，包括压缩设置和保留时间，可以根据需要进行调整

BSL、SSPL 等许可证在将内部项目转换为公共领域时带来了挑战。虽然允许私人使用，但公开提供此类产品可能会导致与所使用的第三方组件（例如 TimescaleDB）相关的法律问题。由于诉讼涉及的成本和风险，这种不确定性阻碍了创始人商业化或开源计划。此外，考虑到与许可相关的可能障碍，潜在投资者可能会犹豫不决。关于查询要求 - 获取 B 区域内各种 A 型传感器的平均值（由客户在过去三个月内汇总） - 假设每分钟 1 个读数；与仅需要定位和循环数据的直接时间序列分析相比，这通常需要相当大的处理能力和时间来在大型服务器上进行 SQL 查询。

原文

We are excited to launch pg_timeseries: a PostgreSQL extension focused on creating a cohesive user experience around the creation, maintenance, and use of time-series tables. You can now use pg_timeseries to create time-series tables, configure the compression and retention of older data, monitor time-series partitions, and run complex time-series analytics functions with a user-friendly syntax. pg_timeseries is open-sourced under the PostgreSQL license and can be added to your existing PostgreSQL installation or tried as a part of the Timeseries Stack on Tembo Cloud.

Put simply, time-series data is a collection of data points, each with a timestamp attached. These could be stock prices recorded throughout a trading day, temperature and availability data returned from devices and sensors, or web traffic on a website. Time-series workloads typically include queries filtering by time (generally for some degree of recency) and aggregation queries to summarize the data for analytics.

time-series

Using PostgreSQL for Time-series workloads

We believe that PostgreSQL can be used to power any data workload thanks to its extensions and ecosystem tools. Therefore, at Tembo, we aim to make it easy for users to use the entire PostgreSQL ecosystem to reduce the complexity of the modern data stack.

In the last year, we’ve built several stacks and extensions that help you run analytics, AI, and operational workloads on PostgreSQL. However, our customers’ most requested stack has been one able to store and act upon their time-series data. That way, they can store all of their data with a single PostgreSQL provider who can meet all their needs.

You may already be asking: “why not just power the stack using TimescaleDB?” The Timescale License would restrict our use of features such as compression, incremental materialized views, and bottomless storage. With these missing, we felt that what remained would not provide an adequate basis for our customers’ time-series needs. Therefore, we decided to build our own PostgreSQL-licensed extension.

Building blocks for pg_timeseries

To efficiently store and query time-series data, there are a few requirements:

Easily manage time-series data
Deal with high-throughput ingest
Answer range queries fast
Efficiently store large amounts of data
Run complex analytics functions

PostgreSQL has several features that provide the right building blocks to solve these requirements. Features such as native partitioning, variety of indexes, materialized views, and window / analytics functions provide the main functionality. Several extensions add additional features to PostgreSQL to make it even better: pg_partman for partition management, pg_cron for scheduling jobs, columnar for compression, pg_ivm for incremental materialized views, and pg_tier for long-term offloading of older partitions.

Making all of these components work together cohesively is a tall order, and is hard for most users. pg_timeseries aims to solve this problem.

pg_timeseries: The simplest way to manage your time-series data in PostgreSQL

pg_timeseries combines the functionality of extensions such as pg_partman, pg_cron, and Hydra’s columnar in order to provide a unified and intuitive interface for managing and querying time-series tables. The PostgreSQL ecosystem has contained many approaches to various aspects of time-series workloads for years and pg_timeseries does the work of making sure all PostgreSQL users experience how these tools work together to make time-series workloads a cinch.

All pg_timeseries requires to get started is a table with a time-like column, partitioned on that column. After that, simply call enable_ts_table:

    CREATE TABLE measurements (
      metric_name text,
      metric_value numeric,
      metric_time timestamptz NOT NULL
    ) PARTITION BY RANGE (metric_time);

    SELECT enable_ts_table('measurements');

The extension includes various views such as ts_table_info and ts_part_info to surface important information about your tables and partitions:

    SELECT table_id, table_size_bytes FROM ts_table_info;

    ┌──────────────┬──────────────────┐
    │   table_id   │ table_size_bytes │
    ├──────────────┼──────────────────┤
    │ measurements │            98304 │
    └──────────────┴──────────────────┘

	SELECT * FROM ts_part_info;

    ┌─[ RECORD 1 ]─────┬───────────────────────────────┐
    │ table_id         │ measurements                  │
    │ part_id          │ measurements_p20240411        │
    │ part_range       │ FOR VALUES FROM ('2024-04-11 …│
    │                  │…00:00:00+00') TO ('2024-04-18…│
    │                  │… 00:00:00+00')                │
    │ table_size_bytes │ 8192                          │
    │ access_method    │ heap                          │
    ├─[ RECORD 2 ]─────┼───────────────────────────────┤
    │ table_id         │ measurements                  │
    │ part_id          │ measurements_p20240418        │
    │ part_range       │ FOR VALUES FROM ('2024-04-18 …│
    │                  │…00:00:00+00') TO ('2024-04-25…│
    │                  │… 00:00:00+00')                │
    │ table_size_bytes │ 8192                          │
    │ access_method    │ heap                          │
    └──────────────────┴───────────────────────────────┘

As partitions “age out” of your time-series tables, you can choose whether to have them compressed using columnar storage, deleted entirely, or both. For example, the following will cause partitions to be compressed if their data is older than 90 days, and will drop them entirely after one year:

    SELECT set_ts_compression_policy('measurements', '90 days');
    SELECT set_ts_retention_policy('measurements', '365 days');

Finally, pg_timeseries includes additional functions like locf, last and date_bin_table to make writing time-series queries easier. We know SQL can sometimes be hard to wrangle into doing what you want with time-series data, so these functions were written to enhance query maintainability. For instance, the following will bin data in the specified range by hour and fill any missing hours with the previous reading. It also includes the name of the metric in each hour with the highest reading:

    SELECT
      locf(avg(metric_value)) OVER (ORDER BY metric_time) avg_val,
      last(metric_name, metric_value) highest,
      metric_time
    FROM date_bin_table(NULL::measurements, '1 hour', '[2024-05-09,2024-06-07]')

    ┌──────────┬─────────┬────────────────────────┐
    │  avg_val │ highest │      metric_time       │
    ├──────────┼─────────┼────────────────────────┤
    │        ∅ │       ∅ │ 2024-05-09 15:00:00+00 │
    │     3.00 │       3 │ 2024-05-09 16:00:00+00 │
    │    55.00 │    temp │ 2024-05-09 17:00:00+00 │
    │    55.00 │    temp │ 2024-05-09 18:00:00+00 │
    │    17.00 │     cpu │ 2024-05-09 19:00:00+00 │
    │    17.00 │     cpu │ 2024-05-09 20:00:00+00 │
    │    23.00 │    disk │ 2024-05-09 21:00:00+00 │
    │    23.00 │    disk │ 2024-05-09 22:00:00+00 │
    │    84.00 │    eth0 │ 2024-05-09 23:00:00+00 │
    │    49.00 │    eth1 │ 2024-05-10 00:00:00+00 │
    │    49.00 │    eth1 │ 2024-05-10 01:00:00+00 │
    └──────────┴─────────┴────────────────────────┘

We’re just getting started

We know building a full-featured time-series extension for PostgreSQL requires a lot of pieces. However, we’re committed to building them in the open, with the community. The current roadmap includes the following features:

Offloading older partitions to cold storage such as S3
Approximate functions for efficient analytics
Incremental materialized views
Roll-up and roll-off of older partitions
Additional analytics helper functions

The entire planned roadmap is present on the GitHub README and features will be prioritized based on demand from users. Our next addition is going to be plugging in pg_tier with pg_timeseries to allow offloading older tables to cold storage such as S3.

The simplest way to try out pg_timeseries is spinning up a free instance of the Timeseries stack on Tembo Cloud. We look forward to your feedback!