配置PostgreSQL以获得读取性能

我们的系统写入大量数据（类似于大数据系统）。写入性能足以满足我们的需求，但读取性能确实太慢。

我们所有表的主键（约束）结构都相似：

 timestamp(Timestamp) ; index(smallint) ; key(integer).

一个表可以有数百万行，甚至数十亿行，并且读取请求通常用于特定时间段（时间戳/索引）和标签。查询返回大约200k行是很常见的。目前，我们每秒可以读取约15,000行，但我们需要提高10倍的速度。

注意：PostgreSQL与我们的软件打包在一起，因此硬件在一个客户端与另一个客户端之间是不同的。

它是使用的VM供测试用。 VM的主机是具有24.0 GB RAM的Windows Server 2008 R2 x64。

服务器规格（虚拟机VMWare）

 Server 2008 R2 x64
2.00 GB of memory
Intel Xeon W3520 @ 2.67GHz (2 cores)

postgresql.conf优化

 shared_buffers = 512MB (default: 32MB)
effective_cache_size = 1024MB (default: 128MB)
checkpoint_segment = 32 (default: 3)
checkpoint_completion_target = 0.9 (default: 0.5)
default_statistics_target = 1000 (default: 100)
work_mem = 100MB (default: 1MB)
maintainance_work_mem = 256MB (default: 16MB)

表定义

 CREATE TABLE "AnalogTransition"
(
  "KeyTag" integer NOT NULL,
  "Timestamp" timestamp with time zone NOT NULL,
  "TimestampQuality" smallint,
  "TimestampIndex" smallint NOT NULL,
  "Value" numeric,
  "Quality" boolean,
  "QualityFlags" smallint,
  "UpdateTimestamp" timestamp without time zone, -- (UTC)
  CONSTRAINT "PK_AnalogTransition" PRIMARY KEY ("Timestamp" , "TimestampIndex" , "KeyTag" ),
  CONSTRAINT "FK_AnalogTransition_Tag" FOREIGN KEY ("KeyTag")
      REFERENCES "Tag" ("Key") MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
  OIDS=FALSE,
  autovacuum_enabled=true
);

查询

查询大约需要30秒才能在pgAdmin3中执行，但我们希望在5秒内获得相同的结果。

 SELECT 
    "AnalogTransition"."KeyTag", 
    "AnalogTransition"."Timestamp" AT TIME ZONE 'UTC', 
    "AnalogTransition"."TimestampQuality", 
    "AnalogTransition"."TimestampIndex", 
    "AnalogTransition"."Value", 
    "AnalogTransition"."Quality", 
    "AnalogTransition"."QualityFlags", 
    "AnalogTransition"."UpdateTimestamp"
FROM "AnalogTransition"
WHERE "AnalogTransition"."Timestamp" >= '2013-05-16 00:00:00.000' AND "AnalogTransition"."Timestamp" <= '2013-05-17 00:00:00.00' AND ("AnalogTransition"."KeyTag" = 56 OR "AnalogTransition"."KeyTag" = 57 OR "AnalogTransition"."KeyTag" = 58 OR "AnalogTransition"."KeyTag" = 59 OR "AnalogTransition"."KeyTag" = 60)
ORDER BY "AnalogTransition"."Timestamp" DESC, "AnalogTransition"."TimestampIndex" DESC
LIMIT 500000;
  
 
解释1 
 
  "Limit  (cost=0.00..125668.31 rows=500000 width=33) (actual time=2.193..3241.319 rows=500000 loops=1)"
"  Buffers: shared hit=190147"
"  ->  Index Scan Backward using "PK_AnalogTransition" on "AnalogTransition"  (cost=0.00..389244.53 rows=1548698 width=33) (actual time=2.187..1893.283 rows=500000 loops=1)"
"        Index Cond: (("Timestamp" >= '2013-05-16 01:00:00-04'::timestamp with time zone) AND ("Timestamp" <= '2013-05-16 15:00:00-04'::timestamp with time zone))"
"        Filter: (("KeyTag" = 56) OR ("KeyTag" = 57) OR ("KeyTag" = 58) OR ("KeyTag" = 59) OR ("KeyTag" = 60))"
"        Buffers: shared hit=190147"
"Total runtime: 3863.028 ms"
 
 
 
解释2 
 
在我最新的测试中，花了7分钟选择了我的数据！参见以下内容：
 
  "Limit  (cost=0.00..313554.08 rows=250001 width=35) (actual time=0.040..410721.033 rows=250001 loops=1)"
"  ->  Index Scan using "PK_AnalogTransition" on "AnalogTransition"  (cost=0.00..971400.46 rows=774511 width=35) (actual time=0.037..410088.960 rows=250001 loops=1)"
"        Index Cond: (("Timestamp" >= '2013-05-22 20:00:00-04'::timestamp with time zone) AND ("Timestamp" <= '2013-05-24 20:00:00-04'::timestamp with time zone) AND ("KeyTag" = 16))"
"Total runtime: 411044.175 ms"
 
 
 
 #1 楼
数据对齐和存储大小
 
实际上，每个索引元组的开销对于元组标头是8字节，对于项目标识符是4字节。 
 
相关内容：
 
 
使用GIN索引位字符串
在PostgreSQL中计算和节省空间
 
我们有三列对于主键：
 
 PRIMARY KEY ("Timestamp" , "TimestampIndex" , "KeyTag")

"Timestamp"      timestamp (8 bytes)
"TimestampIndex" smallint  (2 bytes)
"KeyTag"         integer   (4 bytes)
 
 
结果在：
 
  4 bytes for item identifier in the page header (not counting towards multiple of 8 bytes)

 8 bytes for the index tuple header
 8 bytes "Timestamp"
 2 bytes "TimestampIndex"
 2 bytes padding for data alignment
 4 bytes "KeyTag" 
 0 padding to the nearest multiple of 8 bytes
-----
28 bytes per index tuple; plus some bytes of overhead.
 
 
关于在此相关答案中测量物体尺寸：
 
 
测量PostgreSQL表行的大小
 
多列索引中列的顺序
 
阅读这两个问题和答案了解：
 
 
 
复合索引是否也适合对第一个字段进行查询？ 
在PostgreSQL中使用索引
 
使用索引（主键）的方式，无需进行排序即可检索行，这很吸引人，尤其是LIMIT。但是检索行似乎非常昂贵。
 
通常，在多列索引中，“平等”列应排在最前面，“范围”列应排在最后：
 
 
多列索引和性能
 
因此，请尝试使用具有相反列顺序的其他索引：但是使用millions of row, even billion of rows可能会更快。
 
由于数据对齐和填充，元组的大小大8字节。如果将其用作普通索引，则可以尝试删除第三列"Timestamp"。可能会更快或更慢（因为它可能有助于排序）。
 
您可能希望保留两个索引。根据许多因素，您的原始索引可能会更可取-尤其是LIMIT较小时。 。我确定您正在运行自动抽真空。
 
由于您的表似乎庞大且统计信息对于正确的查询计划很重要，因此我将大幅提高相关列的统计信息目标：
 
数十亿行。最大值为10000，默认值为100。
 
对WHERE或ORDER BY子句中涉及的所有列执行此操作。然后运行ANALYZE。
 
表布局
 
同时，如果您应用所学到的有关数据对齐和填充的知识，则这种优化的表布局应节省一些磁盘空间，稍微提高性能（忽略pk和fk）：使用某个索引（无论是原始索引还是我建议的替代方法），都可以按照索引的物理顺序重写表。 CLUSTER可以做到这一点，但是它具有侵入性，并且在操作过程中需要排他锁。 CLUSTER是一种更复杂的替代方法，无需在表上独占锁定即可执行相同的操作。 pg_repack是以后使用的类似工具（尚未使用过）。
 
这对大型表有很大帮助，因为必须读取的表块少得多。
 
 RAM 
 
通常，2GB的物理RAM不足以快速处理数十亿行。更多的RAM可能会走很长一段路-伴随适当的设置：显然，更大的pg_squeeze开始。
评论
我仅在KeyTag上添加了一个简单的索引，现在看来它很快。我还将应用您有关数据对齐的建议。非常感谢！



 – JPelletier

 2013年6月3日14:35 
 #2 楼
因此，从计划中我看到一件事：您的索引要么过大（然后与基础表一起），要么根本不适用于这种查询（我在上面的最新评论中尝试解决此问题）。 
 
索引的一行包含14个字节的数据（其中一些用于标头）。现在，根据计划中给出的数字进行计算：您从190147页中获得了500,000行-平均而言，每页少于3个有用行，即，每8 kb页大约37个字节。这是一个非常糟糕的比率，不是吗？由于索引的第一列是Timestamp字段，并且在查询中用作范围，因此计划者可以-确实-选择索引以查找匹配的行。但是在TimestampIndex条件中没有提到WHERE，因此在KeyTag上进行过滤并不是很有效，因为这些值据说是随机出现在索引页中的。 br /> 
 CONSTRAINT "PK_AnalogTransition" PRIMARY KEY ("Timestamp", "KeyTag", "TimestampIndex")
 
 
（或者，根据系统的负载，将该索引创建为新索引：
 
 CREATE INDEX CONCURRENTLY "idx_AnalogTransition" 
    ON "AnalogTransition" ("Timestamp", "KeyTag", "TimestampIndex");
 
 
 
这肯定会花点时间，但您仍然可以在此期间进行工作。）
 
另一种可能性是索引页的很大一部分被死行占据，这可能是用吸尘器清除。您使用设置autovacuum_enabled=true创建了表格-但是您是否曾经开始进行自动抽真空？还是手动运行VACUUM？

编程黑洞网

配置PostgreSQL以获得读取性能

#1 楼

评论

#2 楼