Home Database Mysql Tutorial 使用局部索引来提升 PostgreSQL 的性能

使用局部索引来提升 PostgreSQL 的性能

Jun 07, 2016 pm 04:42 PM

大家可能还不知道 PostgreSQL 支持对表数据进行局部索引吧? 它的好处是既能加快这部分索引过的数据的读取速度, 又不会增加额外开

大家可能还不知道 PostgreSQL 支持对表数据进行局部索引吧?  它的好处是既能加快这部分索引过的数据的读取速度, 又不会增加额外开销.  对于那些反复根据给定的 WHERE 子句读出来的数据, 最好的办法就是对这部分数据索引. 这对某些需要预先进行聚集计算的特定分析工作流来说, 很合适. 本帖中, 我将举一个例子说明如何通过部分索引优化数据查询.

假设有这样一个事件表, 结构如下:

每个事件关联一个用户, 有一个 ID, 一个时间戳, 和一个描述事件的 JSON. JSON 的内容包含页面的路径, 事件的类别 (如: 单击, 网页浏览, 表单提交), 以及其他跟事件相关的属性。

我们使用这个表存储各种事件日志. 假设我们手上有个事件自动跟踪器 , 能自动记录用户的每一个点击, 每一次页面浏览, 每一次表单提交, 以便我们以后做分析. 再假设我们想做个内部用的报表(internal dashboard)显示一些有价值的数据(high-value metrics), 如:每周的注册数量, 每天应收帐款. 那么, 问题就来了. 跟这个报表相关的事件, 只占该事件表数据的一小部分 -- 网站的点击量虽然很高, 但是只有很小一部分最终成交! 而这一小部分成交数据跟其他数据混杂放在一起, 也就是说, 它的信噪比很低. 

我们现在想提高报表查询的速度.  先说注册事件吧, 我们把它定义为:注册页面(/signup/)的一次表单提交. 要获得九月份第一周的注册数量, 可以理解成:

对一个包含1千万条记录, 其中只有 3000 条是注册记录, 并且没有做过索引的数据集, 执行这样的查询需要 45 秒.

对单列做全索引(Full Indexes) : 大杂烩

提高查询速度, 比较傻的办法是: 给事件相关的各种属性创建单列索引(single-column index):(data->>'type'),(data->>'path'), 和 time. 通过 bitmap,  我们可以把这三个索引扫描结果合并起来.  如果我们只是有选择地查询其中一部分数据, 而且相关索引依然存在内存中, 查询的速度会变得很快.  刚开始查询大概用 200 毫秒, 后面会降到 20 毫秒 — 比起要花 45 秒查询的顺序扫描, 确实有明显的提高.

这种索引方式有几个弊端:

  • 数据写入的开销. 这种方式在每次 INSERT/UPDATE/DELETE 操作的时候, 需要修改这三个索引的数据.  导致像本例这样频需要繁写入数据的更新数据操作代价太高.

  • 数据查询的限制. 这种方式同时也限制了我们自定义有价值(high-value)事件类型的能力. 比方说, 我们无法在 JSON 字段上做比范围查询更复杂的查询. 具体如:通过正则表达式搜索, 或者查找路径是/signup/ 开头的页面.

  • 磁盘空间的使用. 本例中的提到的表占 6660 mb 磁盘空间, 三个索引和起来有 1026 mb, 随着时间的推移, 这些数字还会不断的暴涨.

  • 局部索引(Partial Indexes)

    我们分析用的注册事件,只占了表中全部数据的 0.03%。而全索引是对全部数据进行索引, 显然不合适。要提高查询速度, 最好的办法是用局部索引。

    以我们对注册事件的定义为过滤条件,创建一个无关列(unrelated column)索引,,通过该索引,PostgreSQL 很容易找到注册事件所在的行,查询速度自然要比在相关字段的3个全索引快的多。 尤其是对时间字段进行局部索引。具体用法如下:

    CREATE INDEX event_signups ON event (time)
    WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

    这个索引的查询速度,会从刚开始的 200 毫秒, 降到 2 毫秒。只要多运行查询语句,速度自然就会加快。更重要的是,局部索引解决了前面提到的全索引的几个缺点。

  • 索引只占 96 kb 磁盘空间, 是全索引的 1026 mb 的 1/10000。

  • 只有新增的行符合注册事件的过滤条件, 才更新索引。由于符合条件的事件只有 0.03%,数据写入的性能得到很大的提高: 基本上,创建和更新这样的索引没有太大的开销。

  • 这样的局部合并(partial join) 允许我们使用 PostgreSQL 提供的各种表达式作为过滤条件。索引中用到的 WHERE 子句,跟在查询语句中的用法没什么两样, 所以我们可以写出很复杂的过滤条件。 如:正则表达式, 函数返回结果,前面提到的前缀匹配。

  • 不要索引结果是布尔值的断言

    我见过有人直接索引布尔表达式:

    (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

    ,然后把时间字段放在第二项. 如:

    CREATE INDEX event_signup_time ON event
    (((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time)

    这样做的后果,比上面两种方法还要严重,因为 PostgreSQL 的查询规划器(query planner)不会将这个布尔表达式当作过滤条件。也就是说,规划器不会把它当作 WHERE 语句:

    WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

    所以,我们索引的字段:

    ((data->>'type') = 'submit' AND (data->>'path') = '/signup/')

    的值始终为 true。 当我们用这个索引当作条件过滤事件的时候,不管表达式的结果是 true 还是 false,都会先把事件数据读出来,加载完后,再过滤。

    这么一来, 索引的时候会从磁盘中读取许多不必要的数据, 此外也要检查每一行数据的有效性. 拿我们例子中的数据集来说, 这样的查询第一次要 25 秒, 之后会降到 8 秒.  这样的结果比索引整个时间字段还要差一些.

    局部索引能在很大程度上, 提高那些通过断言过滤出表中一部分数据的查询的速度. 对于以流量论英雄(Judging by traffic )的 #postgresql IRC 来说, 局部索引显得有些资源利用不足. 对比全索引, 局部索引有适用范围更广的断言(greater range of predicates), 配合高选择性过滤条件(highly selective filters), 写操作和磁盘空间会变得更少. 要是你经常查询某个表中的一小部分数据, 应当优先考虑局部索引.

    是不是开始爱上 PostgreSQL 了?  要了解它的各种功能和特点, 请移步到这里 @danlovesproofs.

    想不想将强大的技术变得更易于使用? 有兴趣就给我们发邮件 jobs@heapanalytics.com.

    Statement of this Website
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

    Hot AI Tools

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Clothoff.io

    Clothoff.io

    AI clothes remover

    Video Face Swap

    Video Face Swap

    Swap faces in any video effortlessly with our completely free AI face swap tool!

    Hot Article

    Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
    3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
    Nordhold: Fusion System, Explained
    4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
    Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
    3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

    Hot Tools

    Notepad++7.3.1

    Notepad++7.3.1

    Easy-to-use and free code editor

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Zend Studio 13.0.1

    Zend Studio 13.0.1

    Powerful PHP integrated development environment

    Dreamweaver CS6

    Dreamweaver CS6

    Visual web development tools

    SublimeText3 Mac version

    SublimeText3 Mac version

    God-level code editing software (SublimeText3)

    Hot Topics

    Java Tutorial
    1669
    14
    PHP Tutorial
    1273
    29
    C# Tutorial
    1256
    24
    MySQL's Role: Databases in Web Applications MySQL's Role: Databases in Web Applications Apr 17, 2025 am 12:23 AM

    The main role of MySQL in web applications is to store and manage data. 1.MySQL efficiently processes user information, product catalogs, transaction records and other data. 2. Through SQL query, developers can extract information from the database to generate dynamic content. 3.MySQL works based on the client-server model to ensure acceptable query speed.

    Explain the role of InnoDB redo logs and undo logs. Explain the role of InnoDB redo logs and undo logs. Apr 15, 2025 am 12:16 AM

    InnoDB uses redologs and undologs to ensure data consistency and reliability. 1.redologs record data page modification to ensure crash recovery and transaction persistence. 2.undologs records the original data value and supports transaction rollback and MVCC.

    MySQL vs. Other Programming Languages: A Comparison MySQL vs. Other Programming Languages: A Comparison Apr 19, 2025 am 12:22 AM

    Compared with other programming languages, MySQL is mainly used to store and manage data, while other languages ​​such as Python, Java, and C are used for logical processing and application development. MySQL is known for its high performance, scalability and cross-platform support, suitable for data management needs, while other languages ​​have advantages in their respective fields such as data analytics, enterprise applications, and system programming.

    How does MySQL index cardinality affect query performance? How does MySQL index cardinality affect query performance? Apr 14, 2025 am 12:18 AM

    MySQL index cardinality has a significant impact on query performance: 1. High cardinality index can more effectively narrow the data range and improve query efficiency; 2. Low cardinality index may lead to full table scanning and reduce query performance; 3. In joint index, high cardinality sequences should be placed in front to optimize query.

    MySQL for Beginners: Getting Started with Database Management MySQL for Beginners: Getting Started with Database Management Apr 18, 2025 am 12:10 AM

    The basic operations of MySQL include creating databases, tables, and using SQL to perform CRUD operations on data. 1. Create a database: CREATEDATABASEmy_first_db; 2. Create a table: CREATETABLEbooks(idINTAUTO_INCREMENTPRIMARYKEY, titleVARCHAR(100)NOTNULL, authorVARCHAR(100)NOTNULL, published_yearINT); 3. Insert data: INSERTINTObooks(title, author, published_year)VA

    MySQL vs. Other Databases: Comparing the Options MySQL vs. Other Databases: Comparing the Options Apr 15, 2025 am 12:08 AM

    MySQL is suitable for web applications and content management systems and is popular for its open source, high performance and ease of use. 1) Compared with PostgreSQL, MySQL performs better in simple queries and high concurrent read operations. 2) Compared with Oracle, MySQL is more popular among small and medium-sized enterprises because of its open source and low cost. 3) Compared with Microsoft SQL Server, MySQL is more suitable for cross-platform applications. 4) Unlike MongoDB, MySQL is more suitable for structured data and transaction processing.

    Explain the InnoDB Buffer Pool and its importance for performance. Explain the InnoDB Buffer Pool and its importance for performance. Apr 19, 2025 am 12:24 AM

    InnoDBBufferPool reduces disk I/O by caching data and indexing pages, improving database performance. Its working principle includes: 1. Data reading: Read data from BufferPool; 2. Data writing: After modifying the data, write to BufferPool and refresh it to disk regularly; 3. Cache management: Use the LRU algorithm to manage cache pages; 4. Reading mechanism: Load adjacent data pages in advance. By sizing the BufferPool and using multiple instances, database performance can be optimized.

    MySQL: Structured Data and Relational Databases MySQL: Structured Data and Relational Databases Apr 18, 2025 am 12:22 AM

    MySQL efficiently manages structured data through table structure and SQL query, and implements inter-table relationships through foreign keys. 1. Define the data format and type when creating a table. 2. Use foreign keys to establish relationships between tables. 3. Improve performance through indexing and query optimization. 4. Regularly backup and monitor databases to ensure data security and performance optimization.

    See all articles