Home Database Mysql Tutorial 如何处理大量数据

如何处理大量数据

Jun 07, 2016 pm 02:59 PM
deal with how data database

如何处理大量数据 提高超大量数据数据库处理速度的方法-表分区 庞大的数据量不光是查询操作,删除起来也痛苦. 使用表分区的效果比较明显.特别是删除操作比较方便,速度也快.直接truncate掉按照rule分区以后的 分区表数据,索引什么都会快速删除掉. 至于查询速度

如何处理大量数据

 

提高超大量数据数据库处理速度的方法-表分区

庞大的数据量不光是查询操作,删除起来也痛苦.

使用表分区的效果比较明显.特别是删除操作比较方便,速度也快.直接truncate掉按照rule分区以后的

分区表数据,索引什么都会快速删除掉.
 

至于查询速度的问题,索引比必不可少的(,如何建立高效的索引这篇文章就不再说明了.)

还有就是负载均衡. 数据库postgresql + postgresforest 可以达到很好的效果.(其实中心思想也是表分区.)

PostgreSQL 支持基本的表分区功能。 本节描述为何需要表分区以及你如何在你的数据库设计里面实现表分区。

 

概述

分区的意思是把逻辑上的一个大表分割成物理上的几块儿。 分区可以提供若干好处: 

某些类型的查询性能可以得到极大提升。 

更新的性能也可以得到提升,因为表的每块的索引要比在整个数据集上的索引要小。 如果索引不能全部放在内存里,那么在索引上的读和写都会产生更多的磁盘访问。 

批量删除可以用简单地删除某个分区来实现 - 只要需求已经在分区设计是进行了规划。 DROP TABLE 比批量 DELETE 要快很多, 因为不需要有 VACUUM 的开销。 

很少用的数据可以移动到便宜的、慢一些地存储介质上。 

这种好处通常只有在表可能会变得非常大的情况下才有价值。 表在多大的情况下会从分区中收益取决于应用,不过有个基本的拇指规则就是表的大小超过了数据库服务器的物理内存大小。 

目前,PostgreSQL 支持通过表继承进行分区。 每个分区必须做为单独一个父表的子表进行创建。父表自身通常是空的; 它的存在只是为了代表整个数据集。你在试图实现分区之前,应该先熟悉继承(参阅 Section 5.8)。 

PostgreSQL 里面可以实现下面形式的分区: 

 

范围分区

表被一个或者多个键字字段分区成"范围", 在这些范围之间没有重叠的数值分布到不同的分区里。 比如,我们可以为特定的商业对象根据数据范围分区,或者根据标识符范围分区。 

列表分区

表是通过明确地列出每个分区里应该出现那些键字值实现的。 

目前还不支持散列分区。 

 

实现分区

要设置一个分区的表,做下面的步骤: 

创建"主表",所有分区都从它上面继承下去。 

这个表将没有什么数据,不要在这个表上定义任何检查约束, 除非你希望约束同样也适用于所有分区。同时在其上定义任何索引或者唯一约束也没有意义。 

创建几个"子"表,每个都从主表上继承。 通常,这些表将不会对从主表继承过来集合增加任何字段。 

我们将把子表称作分区,尽管它们就是普通的 PostgreSQL 表。 

给分区表增加约束,定义每个分区允许的健值。 

典型的例子是: 

CHECK ( x = 1 )

CHECK ( county IN ( 'Oxfordshire', 'Buckinghamshire', 'Warwickshire' ))

CHECK ( outletID >= 100 AND outletID

确信这些约束保证在不同的分区里不会有重叠的键字。一个常见的错误是设置下面这样的范围: 

CHECK ( outletID BETWEEN 100 AND 200 )

CHECK ( outletID BETWEEN 200 AND 300 )

这样做是错误的,因为它没说清楚健值 200 属于那个范围。 

请注意在范围和列表分区的语法方面没有什么区别;这些术语只是用于描述的。 

对于每个分区,在键字字段上创建一个索引,以及其它你想创建的索引。 (键字索引并非严格要求的,但是在大多数情况下它是很有帮助的。 如果你希望键字值是唯一的,那么你应该总是给每个分区创建一个唯一或者主键约束。 

另外,定义一个规则或者触发器,把对主表的修改重定向到合适的分区表。 

确保 postgresql.conf 里的配置参数 constraint_exclusion 是打开的。 没有这个参数,查询不会按照需要进行优化。 

 

比如,假设我们为一个巨大的冰激凌公司构造数据库。 该公司每天都测量最高温度,以及每个地区的冰激凌销售。 概念上,我们需要一个这样的表: 

CREATE TABLE measurement (

city_id int not null,

logdate date not null,

peaktemp int,

unitsales int

);

我们知道大多数查询都只会访问最后一周,最后一个月或者最后一个季度的数据, 因为这个表的主要用途是为管理准备在线报告。 为了减少需要存储的旧数据,我们决定值保留最近三年的有用数据。 在每个月的开头,我们都会删除最旧的一个月的数据。 

在这种情况下,我们可以使用分区来帮助我们实现所有我们对表的不同需求。 下面的步骤描述了上面的需求,分区可以这样设置: 

 

主表是 measurement 表,就像上面那样声明。 

然后我们为每个月创建一个分区: 

CREATE TABLE measurement_yy04mm02 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy04mm03 ( ) INHERITS (measurement);

...

CREATE TABLE measurement_yy05mm11 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy05mm12 ( ) INHERITS (measurement);

CREATE TABLE measurement_yy06mm01 ( ) INHERITS (measurement);

每个分区都是拥有自己内容的完整的表,只是它们从 measurement 表继承定义。 

这样就解决了我们的一个问题:删除旧数据。 每个月,我们需要做的只是在最旧的子表上执行一个 DROP TABLE, 然后为新月份创建一个新的子表。 

我们必须增加非重叠的表约束,所以我们的建表脚本就变成: 

CREATE TABLE measurement_yy04mm02 (

CHECK ( logdate >= DATE '2004-02-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy04mm03 (

CHECK ( logdate >= DATE '2004-03-01' AND logdate

) INHERITS (measurement);

...

CREATE TABLE measurement_yy05mm11 (

CHECK ( logdate >= DATE '2005-11-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy05mm12 (

CHECK ( logdate >= DATE '2005-12-01' AND logdate

) INHERITS (measurement);

CREATE TABLE measurement_yy06mm01 (

CHECK ( logdate >= DATE '2006-01-01' AND logdate

) INHERITS (measurement);

我们可能还需要在键字字段上有索引: 

CREATE INDEX measurement_yy04mm02_logdate ON measurement_yy04mm02 (logdate);

CREATE INDEX measurement_yy04mm03_logdate ON measurement_yy04mm03 (logdate);

...

CREATE INDEX measurement_yy05mm11_logdate ON measurement_yy05mm11 (logdate);

CREATE INDEX measurement_yy05mm12_logdate ON measurement_yy05mm12 (logdate);

CREATE INDEX measurement_yy06mm01_logdate ON measurement_yy06mm01 (logdate);

我们选择先不建立更多的索引。 

如果数据只进入最新的分区,我们可以设置一个非常简单的规则来插入数据。 我们必须每个月都重新定义这个规则,这样它总是指向当前分区。 

CREATE OR REPLACE RULE measurement_current_partition AS

ON INSERT TO measurement

DO INSTEAD

INSERT INTO measurement_yy06mm01 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

我们可能想插入数据并且想让服务器自动定位应该向哪个分区插入数据。 我们可以用像下面这样的更复杂的规则集来实现这个目标。 

CREATE RULE measurement_insert_yy04mm02 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2004-02-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy04mm02 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

...

CREATE RULE measurement_insert_yy05mm12 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2005-12-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy05mm12 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

CREATE RULE measurement_insert_yy06mm01 AS

ON INSERT TO measurement WHERE

( logdate >= DATE '2006-01-01' AND logdate

DO INSTEAD

INSERT INTO measurement_yy06mm01 VALUES ( NEW.city_id,

NEW.logdate,

NEW.peaktemp,

NEW.unitsales );

请注意每个规则里面的 WHERE 子句正好匹配其分区的 CHECK 约束。 

 

我们可以看出来,一个复杂的分区方案可能要求相当不少的 DDL。 在上面的例子里我们需要每个月创建一次新分区,因此写一个脚本自动生成需要的 DDL 是明智的。 

还要注意下面的事项: 

目前还没有什么办法校验所有 CHECK 是相互排他的。 数据库设计者必须注意这一点。 

目前还没有简单的办法声明数据行绝对不能插入主表。 主表上的一个 CHECK (false) 约束将被所有子表继承, 因此不能这么用。一个可行的办法是在主表上设置一个 ON INSERT 触发器,总是抛出错误。(另外,这样的触发器也可以用于重定向数据到合适的子表, 而不是用上面建议的那样一套规则。) 

 

分区也可以使用一个 UNION ALL 试图来安排: 

CREATE VIEW measurement AS

SELECT * FROM measurement_yy04mm02

UNION ALL SELECT * FROM measurement_yy04mm03

...

UNION ALL SELECT * FROM measurement_yy05mm11

UNION ALL SELECT * FROM measurement_yy05mm12

 

分区和约束排除

约束排除(Constraint exclusion)是一种查询优化技巧, 它改进了用上面方法定义的表分区的性能。比如: 

SET constraint_exclusion = on; SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';

如果没有约束排除,上面的查询会扫描 measurement 表中的每一个分区。 打开了约束排除之后,规划器将检查每个分区的约束然后试图证明该分区不需要被扫描, 因为它不能包含任何符合 WHERE 子句条件的数据行。 如果规划器可以证明这个,它就把该分区从查询规划里排除出去。 

你可以使用 EXPLAIN 命令显示一个规划在 constraint_exclusion 打开和关闭情况下的不同。用上面方法设置的表的典型的缺省规划是: 

SET constraint_exclusion = off; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01'; QUERY PLAN ----------------------------------------------------------------------------------------------- Aggregate (cost=158.66..158.68 rows=1 width=0) -> Append (cost=0.00..151.88 rows=2715 width=0) -> Seq Scan on measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy04mm02 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy04mm03 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) ... -> Seq Scan on measurement_yy05mm12 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy06mm01 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date)

部分或者全部分区可能会使用索引扫描而不是全表扫描, 不过这里要表达的意思是我们没有必要扫描旧的分区旧可以回答这个查询。 在我们打开约束排除之后,我们可以得到生成同样回答的明显节省的规划: 

SET constraint_exclusion = on; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01'; QUERY PLAN ----------------------------------------------------------------------------------------------- Aggregate (cost=63.47..63.48 rows=1 width=0) -> Append (cost=0.00..60.75 rows=1086 width=0) -> Seq Scan on measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date) -> Seq Scan on measurement_yy06mm01 measurement (cost=0.00..30.38 rows=543 width=0) Filter: (logdate >= '2006-01-01'::date)

请注意,约束排除只由 CHECK 约束驱动,而不会由索引驱动。 因此,在键字字段上定义索引是没有必要的。 在给出的分区上是否需要建立索引取决于那些扫描该分区的查询通常是扫描该分区的一大部分还是只是一小部分。 对于后者,索引通常都有帮助,对于前者则没有什么好处。 

还有下面的注意: 

约束排除只是在查询的 WHERE 子句包含约束的时候才生效。 一个参数化的查询不会被优化,因为在运行时规划器不知道改参数会选择哪个分区。 由于某些原因,像 CURRENT_DATE 这样的"稳定的(stable)"函数必须避免。 把分区键字和另外一个表的字段连接起来也不会得到优化。 

在 CHECK 约束里面避免跨数据类型的比较, 因为目前规划器会无法证明这样的条件为假。比如,下面的约束会在 x 是整数字段的时候可用,但是在 x 是一个 bigint 的时候不能用: 

CHECK ( x = 1 )

对于 bigint 字段,我们必须使用类似下面这样的约束: 

CHECK ( x = 1::bigint )

这个问题并不仅仅局限于 bigint 数据类型 — 它可能会发生在任何约束的缺省数据类型与其比较的字段的数据类型不匹配的场合。 在提交的查询里的跨数据类型的比较通常是 OK 的,只是不能在 CHECK 条件里。 

目前,在主表上的 UPDATE 和 DELETE 命令并不执行约束排除。 

主表的所有分区上面的所有约束都认为是约束排除了的,因此,大量的分区会显著增加查询规划的时间。 

别忘记你仍然需要为每个分区独立运行 ANALYZE。 类似下面的命令 

ANALYZE measurement;

是只会处理主表的。 

 

UNION ALL SELECT * FROM measurement_yy06mm01;

不过,约束排除目前还不支持用这种方式定义的分区。 还有,重建试图也给增加和删除数据集里面的独立分区增加了额外的步骤。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1662
14
PHP Tutorial
1261
29
C# Tutorial
1234
24
The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. May 07, 2024 pm 05:00 PM

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,

AI startups collectively switched jobs to OpenAI, and the security team regrouped after Ilya left! AI startups collectively switched jobs to OpenAI, and the security team regrouped after Ilya left! Jun 08, 2024 pm 01:00 PM

Last week, amid the internal wave of resignations and external criticism, OpenAI was plagued by internal and external troubles: - The infringement of the widow sister sparked global heated discussions - Employees signing "overlord clauses" were exposed one after another - Netizens listed Ultraman's "seven deadly sins" Rumors refuting: According to leaked information and documents obtained by Vox, OpenAI’s senior leadership, including Altman, was well aware of these equity recovery provisions and signed off on them. In addition, there is a serious and urgent issue facing OpenAI - AI safety. The recent departures of five security-related employees, including two of its most prominent employees, and the dissolution of the "Super Alignment" team have once again put OpenAI's security issues in the spotlight. Fortune magazine reported that OpenA

iOS 18 adds a new 'Recovered' album function to retrieve lost or damaged photos iOS 18 adds a new 'Recovered' album function to retrieve lost or damaged photos Jul 18, 2024 am 05:48 AM

Apple's latest releases of iOS18, iPadOS18 and macOS Sequoia systems have added an important feature to the Photos application, designed to help users easily recover photos and videos lost or damaged due to various reasons. The new feature introduces an album called "Recovered" in the Tools section of the Photos app that will automatically appear when a user has pictures or videos on their device that are not part of their photo library. The emergence of the "Recovered" album provides a solution for photos and videos lost due to database corruption, the camera application not saving to the photo library correctly, or a third-party application managing the photo library. Users only need a few simple steps

Detailed tutorial on establishing a database connection using MySQLi in PHP Detailed tutorial on establishing a database connection using MySQLi in PHP Jun 04, 2024 pm 01:42 PM

How to use MySQLi to establish a database connection in PHP: Include MySQLi extension (require_once) Create connection function (functionconnect_to_db) Call connection function ($conn=connect_to_db()) Execute query ($result=$conn->query()) Close connection ( $conn->close())

How to handle database connection errors in PHP How to handle database connection errors in PHP Jun 05, 2024 pm 02:16 PM

To handle database connection errors in PHP, you can use the following steps: Use mysqli_connect_errno() to obtain the error code. Use mysqli_connect_error() to get the error message. By capturing and logging these error messages, database connection issues can be easily identified and resolved, ensuring the smooth running of your application.

70B model generates 1,000 tokens in seconds, code rewriting surpasses GPT-4o, from the Cursor team, a code artifact invested by OpenAI 70B model generates 1,000 tokens in seconds, code rewriting surpasses GPT-4o, from the Cursor team, a code artifact invested by OpenAI Jun 13, 2024 pm 03:47 PM

70B model, 1000 tokens can be generated in seconds, which translates into nearly 4000 characters! The researchers fine-tuned Llama3 and introduced an acceleration algorithm. Compared with the native version, the speed is 13 times faster! Not only is it fast, its performance on code rewriting tasks even surpasses GPT-4o. This achievement comes from anysphere, the team behind the popular AI programming artifact Cursor, and OpenAI also participated in the investment. You must know that on Groq, a well-known fast inference acceleration framework, the inference speed of 70BLlama3 is only more than 300 tokens per second. With the speed of Cursor, it can be said that it achieves near-instant complete code file editing. Some people call it a good guy, if you put Curs

China Mobile: Humanity is entering the fourth industrial revolution and officially announced 'three plans” China Mobile: Humanity is entering the fourth industrial revolution and officially announced 'three plans” Jun 27, 2024 am 10:29 AM

According to news on June 26, at the opening ceremony of the 2024 World Mobile Communications Conference Shanghai (MWC Shanghai), China Mobile Chairman Yang Jie delivered a speech. He said that currently, human society is entering the fourth industrial revolution, which is dominated by information and deeply integrated with information and energy, that is, the "digital intelligence revolution", and the formation of new productive forces is accelerating. Yang Jie believes that from the "mechanization revolution" driven by steam engines, to the "electrification revolution" driven by electricity, internal combustion engines, etc., to the "information revolution" driven by computers and the Internet, each round of industrial revolution is based on "information and "Energy" is the main line, bringing productivity development

How to use database callback functions in Golang? How to use database callback functions in Golang? Jun 03, 2024 pm 02:20 PM

Using the database callback function in Golang can achieve: executing custom code after the specified database operation is completed. Add custom behavior through separate functions without writing additional code. Callback functions are available for insert, update, delete, and query operations. You must use the sql.Exec, sql.QueryRow, or sql.Query function to use the callback function.

See all articles