👏🏽 💪🏿 ♌️ 我们正在准备在Postgres中进行全文搜索。第二部分 🤬 👨🏾‍🏫 🥙

在上一篇文章中，我们使用标准工具优化了PostgreSQL中的搜索。在本文中，我们将继续使用RUM索引进行优化，并分析其与GIN相比的优缺点。

引言

RUM是Postgres的扩展，Postgres是全文搜索的新索引。它允许您返回通过索引时按相关性排序的结果。我将不关注其安装-它在资源库的README中进行了描述。

我们使用索引

创建的索引类似于GIN索引，但是带有一些参数。完整的参数列表可以在文档中找到。

CREATE INDEX idx_rum_document ON documents_documentvector USING rum ("text" rum_tsvector_ops);

搜索查询的RUM：

 SELECT document_id, "text" <=> plainto_tsquery('') AS rank FROM documents_documentvector WHERE "text" @@ plainto_tsquery('') ORDER BY rank;

要求GIN

 SELECT document_id, ts_rank("text", plainto_tsquery('')) AS rank FROM documents_documentvector WHERE "text" @@ plainto_tsquery('') ORDER BY rank DESC;

与GIN的区别在于，不是通过ts_rank函数获得相关性，而是使用<=> ： "text" <=> plainto_tsquery('')运算符进行"text" <=> plainto_tsquery('') 。这样的查询返回搜索向量和搜索查询之间的某个距离。它越小，查询与向量的匹配越好。

与GIN的比较

在这里，我们将在约500,000个文档的基础上进行测试比较，以发现搜索结果的差异。

要求速度

让我们看看在此基础上会产生什么样的GIN解释：

 Gather Merge (actual time=563.840..611.844 rows=119553 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (actual time=553.427..557.857 rows=39851 loops=3) Sort Key: (ts_rank(text, plainto_tsquery(''::text))) Sort Method: external sort Disk: 1248kB -> Parallel Bitmap Heap Scan on documents_documentvector (actual time=13.402..538.879 rows=39851 loops=3) Recheck Cond: (text @@ plainto_tsquery(''::text)) Heap Blocks: exact=5616 -> Bitmap Index Scan on idx_gin_document (actual time=12.144..12.144 rows=119553 loops=1) Index Cond: (text @@ plainto_tsquery(''::text)) Planning time: 4.573 ms Execution time: 617.534 ms

和朗姆酒？

 Sort (actual time=1668.573..1676.168 rows=119553 loops=1) Sort Key: ((text <=> plainto_tsquery(''::text))) Sort Method: external merge Disk: 3520kB -> Bitmap Heap Scan on documents_documentvector (actual time=16.706..1605.382 rows=119553 loops=1) Recheck Cond: (text @@ plainto_tsquery(''::text)) Heap Blocks: exact=15599 -> Bitmap Index Scan on idx_rum_document (actual time=14.548..14.548 rows=119553 loops=1) Index Cond: (text @@ plainto_tsquery(''::text)) Planning time: 0.650 ms Execution time: 1679.315 ms

这是什么您会问，这种自负的RUM的运行速度是否比GIN慢三倍？索引内臭名昭著的排序在哪里？

冷静：让我们尝试为请求添加LIMIT 1000 。

朗姆酒说明

 限制（实际时间= 115.568..137.313行= 1000循环= 1）
    ->使用document_documentvector上的idx_rum_document进行索引扫描（实际时间= 115.567..137.239行= 1000循环= 1）
         索引条件：（文本@@ plainto_tsquery（'query'::文本））
         排序依据：（文本<=> plainto_tsquery（'query'::文本））
 计划时间：0.481毫秒
 执行时间：137.678毫秒

杜松子酒的解释

 限制（实际时间= 579.905..585.650行= 1000循环= 1）
    ->合并合并（实际时间= 579.904..585.604行= 1000循环= 1）
         计划的工人人数：2
         工人启动：2
          ->排序（实际时间= 574.061..574.171行= 992循环= 3）
               排序键：（ts_rank（文本，plainto_tsquery（'query'::文本）））DESC
               排序方法：外部合并磁盘：1224kB
                ->对documents_documentvector进行并行位图堆扫描（实际时间= 8.920..555.571行= 39851循环= 3）
                     重新检查条件：（文本@@ plainto_tsquery（'query'::文本））
                     堆块：精确= 5422
                      ->在idx_gin_document上进行位图索引扫描（实际时间= 8.945..8.945行= 119553循环= 1）
                           索引条件：（文本@@ plainto_tsquery（'query'::文本））
 计划时间：0.223毫秒
 执行时间：585.948毫秒

〜150毫秒和〜600毫秒！已经不赞成GIN了吧？排序已移至索引内部！

如果您要寻找LIMIT 100 ？

朗姆酒说明

 限制（实际时间= 105.863..108.530行= 100个循环= 1）
    ->使用document_documentvector上的idx_rum_document进行索引扫描（实际时间= 105.862..108.517行= 100循环= 1）
         索引条件：（文本@@ plainto_tsquery（'query'::文本））
         排序依据：（文本<=> plainto_tsquery（'query'::文本））
 计划时间：0.199毫秒
 执行时间：108.958毫秒

杜松子酒的解释

 限制（实际时间= 582.924..588.351行= 100循环= 1）
    ->合并合并（实际时间= 582.923..588.344行= 100循环= 1）
         计划的工人人数：2
         工人启动：2
          ->排序（实际时间= 573.809..573.889行= 806循环= 3）
               排序键：（ts_rank（文本，plainto_tsquery（'query'::文本）））DESC
               排序方法：外部合并磁盘：1224kB
                ->对documents_documentvector进行并行位图堆扫描（实际时间= 18.038..552.827行= 39851循环= 3）
                     重新检查条件：（文本@@ plainto_tsquery（'query'::文本））
                     堆块：精确= 5275
                      ->在idx_gin_document上进行位图索引扫描（实际时间= 16.541..16.541行= 119553循环= 1）
                           索引条件：（文本@@ plainto_tsquery（'query'::文本））
 计划时间：0.487毫秒
 执行时间：588.583 ms

差异更加明显。

问题是，GIN到底与您获得多少行无关紧要-它必须遍历请求成功的所有行，并对它们进行排名。 RUM仅针对我们真正需要的行执行此操作。如果我们需要很多线路，则GIN胜出。它的ts_rank比<=>运算符ts_rank有效地执行计算。但是在小的查询中，RUM的优势是不可否认的。

通常，用户不需要一次从数据库上载所有5万个文档。他在第一页，第二页，第三页等上仅需要10条帖子。正是在这种情况下，该索引才被锐化，并且可以在很大程度上提高搜索性能。

加入公差

如果搜索要求您加入另一个或多个表怎么办？例如，要在结果中显示文档的类型，它的所有者？还是像我这样，按相关实体的名称过滤？

比较：

要求两次加入以获取GIN

 SELECT document_id, ts_rank("text", plainto_tsquery('')) AS rank, case_number FROM documents_documentvector RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id LEFT JOIN documents_case ON documents_document.case_id = documents_case.id WHERE "text" @@ plainto_tsquery('') ORDER BY rank DESC LIMIT 10;

结果：

限制（实际时间= 1637.902..1643.483行= 10个循环= 1）
    ->合并（实际时间= 1637.901..1643.479行= 10个循环= 1）
         计划的工人人数：2
         工人启动：2
          ->排序（实际时间= 1070.614..1070.687行= 652循环= 3）
               排序键：（ts_rank（documents_documentvector.text，plainto_tsquery（'query':: text）））DESC
               排序方法：外部合并磁盘：2968kB
                ->哈希左联接（实际时间= 323.386..1049.092行= 39851循环= 3）
                     哈希值：（documents_document.case_id = documents_case.id）
                      ->哈希联接（实际时间= 239.312..324.797行= 39851循环= 3）
                           哈希值：（documents_documentvector.document_id = documents_document.id）
                            ->对documents_documentvector进行并行位图堆扫描（实际时间= 11.022..37.073行= 39851循环= 3）
                                 重新检查条件：（文本@@ plainto_tsquery（'query'::文本））
                                 堆块：精确= 9362
                                  ->在idx_gin_document上进行位图索引扫描（实际时间= 12.094..12.094行= 119553循环= 1）
                                       索引条件：（文本@@ plainto_tsquery（'query'::文本））
                            ->哈希（实际时间= 227.856..227.856行= 472089循环= 3）
                                 存储桶：65536批次：16内存使用量：2264kB
                                  ->对documents_document进行Seq扫描（实际时间= 0.009..147.104行= 472089循环= 3）
                      ->哈希（实际时间= 83.338..83.338行= 273695循环= 3）
                           储存桶：65536批次：8记忆体使用量：2602kB
                            ->对documents_case进行Seq扫描（实际时间= 0.009..39.082行= 273695循环= 3）
计划时间：0.857毫秒
执行时间：1644.028 ms

在三个以上的联接中，请求时间达到2-3秒，并且随着联接数的增加而增加。

但是RUM呢？让请求立即加入五个连接。

五人加入RUM

 SELECT document_id, "text" <=> plainto_tsquery('') AS rank, case_number, classifier_procedure.title, classifier_division.title, classifier_category.title FROM documents_documentvector RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id LEFT JOIN documents_case ON documents_document.case_id = documents_case.id LEFT JOIN classifier_procedure ON documents_case.procedure_id = classifier_procedure.id LEFT JOIN classifier_division ON documents_case.division_id = classifier_division.id LEFT JOIN classifier_category ON documents_document.category_id = classifier_category.id WHERE "text" @@ plainto_tsquery('') AND documents_document.is_active IS TRUE ORDER BY rank LIMIT 10;

结果：

 限制（实际时间= 70.524..72.292行= 10个循环= 1）
   ->嵌套循环左联接（实际时间= 70.521..72.279行= 10个循环= 1）
         ->嵌套循环左联接（实际时间= 70.104..70.406行= 10个循环= 1）
               ->嵌套循环左联接（实际时间= 70.089..70.351行= 10个循环= 1）
                     ->嵌套循环左联接（实际时间= 70.073..70.302行= 10个循环= 1）
                           ->嵌套循环（实际时间= 70.052..70.201行= 10个循环= 1）
                                 ->使用document_documentvector上的document_vector_rum_index进行索引扫描（实际时间= 70.001..70.035行= 10个循环= 1）
                                      索引条件：（文本@@ plainto_tsquery（'query'::文本））
                                      排序依据：（文本<=> plainto_tsquery（'query'::文本））
                                 ->使用documents_document上的documents_document_pkey进行索引扫描（实际时间= 0.013..0.013行= 1循环= 10）
                                      索引条件：（id = documents_documentvector.document_id）
                                      筛选器：（is_active IS TRUE）
                           ->使用documents_case上的documents_case_pkey进行索引扫描（实际时间= 0.009..0.009行= 1循环= 10）
                                索引条件：（documents_document.case_id = id）
                     ->使用classifier_procedure上的classifier_procedure_pkey进行索引扫描（实际时间= 0.003..0.003行= 1循环= 10）
                          索引条件：（documents_case.procedure_id = id）
               ->使用classifier_division上的classifier_division_pkey进行索引扫描（实际时间= 0.004..0.004行= 1循环= 10）
                    索引条件：（documents_case.division_id = id）
         ->使用classifier_category上的classifier_category_pkey进行索引扫描（实际时间= 0.003..0.003行= 1循环= 10）
              索引条件：（documents_document.category_id = id）
计划时间：2.861毫秒
执行时间：72.865毫秒

如果您在搜索时离不开加入，那么RUM很适合您。

磁碟空间

在大约50万个文档和3.6 GB索引的测试基础上，它们占用的容量非常不同。

  idx_rum_document |  1950兆字节
  idx_gin_document |  418兆字节

是的，驱动器很便宜。但是不能取2 GB而不是400 MB。索引的基础大小只有一半。在这里，GIN无条件获胜。

结论

如果满足以下条件，则需要朗姆酒：

您有很多文档，但是您逐页给出搜索结果
您需要对搜索结果进行复杂的过滤
你不介意磁盘空间

如果满足以下条件，您将对GIN感到完全满意：

你的基数很小
您的基础庞大，但是您需要立即产生结果，就是这样
您不需要使用join进行过滤
您是否对磁盘上的最小索引大小感兴趣？

我希望本文能删除很多WTF ?!在Postgres中工作和设置搜索时会发生这种情况。我很高兴听到那些知道如何更好地配置一切的人的建议！）

在接下来的部分中，我计划在项目中进一步介绍RUM：有关在Django + PostgreSQL捆绑包中使用其他RUM选项的信息。

我们正在准备在Postgres中进行全文搜索。 第二部分

引言

我们使用索引

与GIN的比较

要求速度

加入公差

磁碟空间

结论

More articles:

我们正在准备在Postgres中进行全文搜索。第二部分