Home headlines Ceph object storage must be understood before going online

Ceph object storage must be understood before going online

Mar 12, 2018 am 09:14 AM
ceph storage object

As an open source distributed storage software, Ceph can use the local storage resources of the X86 server to create one or more storage resource pools, and provide users with unified storage services based on the resource pools, including block storage and object storage , file storage, which meets the needs of enterprises for high reliability, high performance, and high scalability of storage, and is increasingly favored by enterprises. After a large amount of production practice, it has been proven that Ceph has advanced design concepts, comprehensive functions, flexible use, and high flexibility. However, these advantages of Ceph are also a double-edged sword for enterprises. If they are well controlled, they can serve the enterprise well. If they are not skilled enough and do not understand the temperament of Ceph, it will sometimes cause a lot of trouble. Below I will It is such a case that I want to share with you.

Ceph object storage must be understood before going online

Company A deploys Ceph object storage clusters to provide external cloud storage services and provides SDK to help customers quickly realize unstructured data such as pictures, videos, and apk installation packages. Cloud management. Before the business was officially launched, sufficient functional testing, exception testing and performance testing were conducted on Ceph.

The cluster size is not very large. It uses the community version 0.80. There are a total of 30 servers. Each server is configured with 32GB memory, 10 4T SATA disks and 1 160G Intel S3700 SSD disk. 300 SATA disks form a data resource pool (the default configuration is a pool named .rgw.buckets) to store object data; 30 SSD disks form a metadata resource pool (the default configuration is a pool named . rgw.buckets.index pool), which stores object metadata. Friends who have experience in Ceph object storage operation and maintenance deployment know that this configuration is also an international practice because Ceph object storage supports multi-tenancy. When multiple users PUT objects into the same bucket (a logical space of the user) , the metadata of the object will be written to the bucket index object. Since the same bucket index object is shared, this index object needs to be locked during access, and the bucket index object will be stored in a resource pool composed of high-performance disk SSD, reducing The time of each index object access is improved, IO performance is improved, and the overall concurrency of object storage is increased.

After the system went online, customer data began to be continuously stored in the Ceph object storage cluster. In the first three months, everything ran normally. SATA disk failures also occurred during this period, and they were easily resolved by relying on Ceph's own fault detection and repair mechanisms. The operation and maintenance brothers felt very relaxed. Entering May, the operation and maintenance brothers occasionally complained that the OSD corresponding to the SSD disk sometimes became very slow, causing business lags. When encountering this situation, their simple and effective way was to restart the OSD and return to normal. This phenomenon probably happened sporadically a few times, and the operation and maintenance brother asked if there was anything wrong with our use of SSD. After analysis, we feel that there is nothing special about the application of SSD disks. Except for changing the disk scheduling algorithm to deadline, this has already been modified, so we don’t pay too much attention to this matter.

At 21:30 in the evening on May 28, the operation and maintenance brothers received a system alarm on their mobile phones. A small number of files failed to be written. They immediately logged in to the system to check and found that it was because of the OSD corresponding to the SSD disk on a server. Caused by slow reading and writing. According to previous experience, in such cases, the OSD process can be restored to normal after restarting it. Restart the OSD process without hesitation and wait for the system to return to normal. But this time, the SSD's OSD process started very slowly, causing the SATA disk's OSD process on the same server to freeze and lose its heartbeat. After a period of time, it was discovered that the SSD disk's OSD process started to freeze slowly on other servers. Continue to restart the OSD processes corresponding to the SSD disks on other servers, and a similar situation occurs. After restarting the SSD disk OSD processes repeatedly, more and more SSD disk OSD processes cannot be started. The operation and maintenance brothers immediately reported the situation to the technical research and development department and requested immediate support.

After arriving at the office, based on the feedback from the operation and maintenance brothers, we boarded the server, tried to start the OSD processes corresponding to several SSD disks, and repeatedly observed the startup process of the comparison process:

1 . Use the top command to find that the OSD process starts to allocate memory crazily after it starts, up to 20GB or even sometimes 30GB; sometimes the system memory is exhausted and swap partitions are used; sometimes even if the process is successfully pulled up in the end, the OSD is still occupied. Up to 10GB of memory.

2. Check the OSD log and find that the log output stops after entering the FileJournal::_open stage. After a long time (more than 30 minutes), the output enters the load_pg stage; after entering the load_pg stage, there is another long wait. Although load_pg is completed, the process still commits suicide and exits.

3. During the above long startup process, use pstack to view the process call stack information. The call stack seen in the FileJournal::_open stage is OSD log playback, and levelDB record deletion is performed during transaction processing; The call stack information seen in the load_pg stage is using the levelDB log to repair the levelDB file.

4. Sometimes an SSD disk OSD process starts successfully, but after running for a period of time, another SSD disk OSD process will die abnormally.

Judging from these phenomena, they are all related to levelDB. Is the large allocation of memory related to this? After further looking at the levelDB-related code, we found that when a levelDB iterator is used in a transaction, memory will be continuously allocated during the iterator's access to records, and all memory will not be released until the iterator is used up. From this point of view, if the number of records accessed by the iterator is very large, a large amount of memory will be allocated during the iteration process. Based on this, we looked at the number of objects in the bucket and found that the number of objects in several buckets reached 20 million, 30 million, and 50 million, and the storage locations of these large bucket index objects happened to be the ones where the problem occurred. SSD disk OSD. The reason for the large amount of memory consumption should have been found. This is a major breakthrough. It is already 21:00 on the 30th. In the past two days, users have started calling to complain, and the brothers all feel that it is "big trouble". They have been fighting for nearly 48 hours, and the brothers' eyes are red and swollen. They must stop and rest, otherwise some brothers will fall before dawn.

At 8:30 on the 31st, the brothers went into battle again.

Another problem is that some OSDs go through a long startup process and eventually exit by suicide after load_pg is completed. By reading the ceph code, it was confirmed that some threads committed suicide due to timeout due to not being scheduled for a long time (possibly due to the levelDB thread occupying the CPU for a long time). There is a filestore_op_thread_suicide_timeout parameter in the ceph configuration. Through testing and verification, setting this parameter to a large value can avoid this kind of suicide. We saw a little bit of light again, and the clock pointed to 12:30.

After some processes are started, they will still occupy up to 10GB of memory. If this problem is not solved, even if the SSD disk OSD is pulled up, the operation of other SATA disk OSDs on the same server will be affected due to insufficient memory. Brothers, keep up the good work, this is the darkness before dawn, you must get through it. Some people checked the information, some looked at the code, and finally found a command to force memory release from the ceph information document at 14:30: ceph tell osd.* heap release. You can execute this command after the process starts to release the excessive memory occupied by the OSD process. . Everyone was very excited and immediately tested and verified that it was indeed effective.

After an SSD disk OSD is started and runs for a while, the OSD process of other SSD disks will exit. Based on the above analysis and positioning, this is mainly due to data migration. The OSD with data migration will Deleting related record information triggers levelDB to delete object metadata records. Once it encounters an oversized bucket index object, levelDB uses an iterator to traverse the object's metadata records, which will cause excessive memory consumption and cause the OSD process on the server to be abnormal.

Based on the above analysis and after nearly 2 hours of repeated discussions and demonstrations, we have formulated the following emergency measures:

1. Set the noout flag to the cluster and do not allow PG migration, because once it occurs PG migration, if the OSD has a PG moved out, the object data in the PG will be deleted after the PG is moved out, triggering levelDB to delete the object metadata record. If there is an oversized bucket index object in the PG, the iterator will traverse the metadata. Data logging consumes a lot of memory.

2. In order to save the OSD corresponding to the SSD and restore the system as soon as possible, when starting the OSD process corresponding to the SSD, add the startup parameter filestore_op_thread_suicide_timeout and set a large value. When the faulty OSD is pulled up, a series of LevelDB processes will seize the CPU, causing thread scheduling to be blocked. There is a thread deadlock detection mechanism in Ceph. If the thread is still not scheduled after the time configured by this parameter, it is determined to be a thread deadlock. In order to avoid process suicide due to thread deadlock, this parameter needs to be set.

3. In the current situation of limited memory, abnormal OSD startup will use the swap partition. In order to speed up the OSD process startup, adjust the swap partition to the SSD disk.

4. Start a scheduled task and execute the command ceph tell osd.* heap release regularly to force the memory occupied by OSD to be released.

5. When there is a problem with the OSD corresponding to the SSD, follow the steps below:

a) First stop all OSD processes on the server to free up all memory.

b) Then start the OSD process and carry the filestore_op_thread_suicide_timeout parameter, giving a large value, such as 72000.

c) Observe the startup process of OSD. Once load_pgs is executed, you can immediately manually execute the ceph tell osd.N heap release command to forcibly release the memory occupied by it.

d) Observe the cluster status. When the status of all PGs returns to normal, start the OSD processes corresponding to other SATA disks.

Following the above steps, we will restore the OSD processes one by one starting from 17:30. During the recovery process, those very large bucket index objects will take a long time to do backfilling. During this period, the bucket will be accessed. All requests are blocked, causing application business requests to time out. This is also a negative impact caused by storing a large number of objects in a single bucket.

At 23:00 on May 31st, all OSD processes were finally restored. From the failure to the successful recovery of the system, we worked hard for 72 thrilling hours. Everyone looked at each other and smiled, overly excited, and continued to work hard to discuss and make plans together. Solutions to completely solve this problem:

1. Expand the server memory to 64GB.

2. For new buckets, limit the maximum number of storage objects.

3. After Ceph version 0.94 has been fully tested, it will be upgraded to version 0.94 to solve the problem of excessively large single-bucket index objects.

4. Optimize Ceph's use of levelDB iterators. In a large transaction, through segmented iteration, an iterator records its current iteration position and releases it after completing a certain number of record traversals. , and then re-create a new iterator and continue traversing from the position of the last iteration, so that the memory usage of the iterator can be controlled.

As a teacher who never forgets the past and learns lessons from the past, we summarize the following points:

1. The system must be fully tested before going online

Company A's Before the system went online, although Ceph was fully tested for functions, performance, and exceptions, there was no stress test on a large amount of data. If tens of millions of objects were tested in a single bucket before, this hidden danger might be discovered in advance.

2. Every abnormality in the operation and maintenance process must be paid attention to in time

In this case, some time before the problem broke out, the operation and maintenance department had already reported the problem of SSD abnormality. Unfortunately, We did not pay attention to it. If we had conducted in-depth analysis at that time, we might have been able to find the root cause of the problem and formulate avoidance measures in advance.

3. Find out the temperament of ceph

Any software product has corresponding specification restrictions, and ceph is no exception. If we can have an in-depth understanding of the ceph architecture and its implementation principles in advance, understand the negative impacts of excessive storage of a large number of objects in a single bucket, and plan in advance, the problems encountered in this case will not occur. RGW has very comprehensive support for quotas, including user-level and bucket-level quotas. The maximum number of objects allowed to be stored in a single bucket can be configured.

4. Always track the latest progress of the community

In version 0.94 of Ceph, the shard function of bucket index objects has been supported. A bucket index object can be divided into multiple shard objects for storage, which can be effectively Alleviate the problem of excessively large single-bucket index objects.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Huawei will launch innovative MED storage products next year: rack capacity exceeds 10 PB and power consumption is less than 2 kW Huawei will launch innovative MED storage products next year: rack capacity exceeds 10 PB and power consumption is less than 2 kW Mar 07, 2024 pm 10:43 PM

This website reported on March 7 that Dr. Zhou Yuefeng, President of Huawei's Data Storage Product Line, recently attended the MWC2024 conference and specifically demonstrated the new generation OceanStorArctic magnetoelectric storage solution designed for warm data (WarmData) and cold data (ColdData). Zhou Yuefeng, President of Huawei's data storage product line, released a series of innovative solutions. Image source: Huawei's official press release attached to this site is as follows: The cost of this solution is 20% lower than that of magnetic tape, and its power consumption is 90% lower than that of hard disks. According to foreign technology media blocksandfiles, a Huawei spokesperson also revealed information about the magnetoelectric storage solution: Huawei's magnetoelectronic disk (MED) is a major innovation in magnetic storage media. First generation ME

Convert an array or object to a JSON string using PHP's json_encode() function Convert an array or object to a JSON string using PHP's json_encode() function Nov 03, 2023 pm 03:30 PM

JSON (JavaScriptObjectNotation) is a lightweight data exchange format that has become a common format for data exchange between web applications. PHP's json_encode() function can convert an array or object into a JSON string. This article will introduce how to use PHP's json_encode() function, including syntax, parameters, return values, and specific examples. Syntax The syntax of the json_encode() function is as follows: st

Vue3+TS+Vite development skills: how to encrypt and store data Vue3+TS+Vite development skills: how to encrypt and store data Sep 10, 2023 pm 04:51 PM

Vue3+TS+Vite development tips: How to encrypt and store data. With the rapid development of Internet technology, data security and privacy protection are becoming more and more important. In the Vue3+TS+Vite development environment, how to encrypt and store data is a problem that every developer needs to face. This article will introduce some common data encryption and storage techniques to help developers improve application security and user experience. 1. Data Encryption Front-end Data Encryption Front-end encryption is an important part of protecting data security. Commonly used

Use Python's __contains__() function to define the containment operation of an object Use Python's __contains__() function to define the containment operation of an object Aug 22, 2023 pm 04:23 PM

Use Python's __contains__() function to define the containment operation of an object. Python is a concise and powerful programming language that provides many powerful features to handle various types of data. One of them is to implement the containment operation of objects by defining the __contains__() function. This article will introduce how to use the __contains__() function to define the containment operation of an object, and give some sample code. The __contains__() function is Pytho

How to convert MySQL query result array to object? How to convert MySQL query result array to object? Apr 29, 2024 pm 01:09 PM

Here's how to convert a MySQL query result array into an object: Create an empty object array. Loop through the resulting array and create a new object for each row. Use a foreach loop to assign the key-value pairs of each row to the corresponding properties of the new object. Adds a new object to the object array. Close the database connection.

Git installation process on Ubuntu Git installation process on Ubuntu Mar 20, 2024 pm 04:51 PM

Git is a fast, reliable, and adaptable distributed version control system. It is designed to support distributed, non-linear workflows, making it ideal for software development teams of all sizes. Each Git working directory is an independent repository with a complete history of all changes and the ability to track versions even without network access or a central server. GitHub is a Git repository hosted on the cloud that provides all the features of distributed revision control. GitHub is a Git repository hosted on the cloud. Unlike Git which is a CLI tool, GitHub has a web-based graphical user interface. It is used for version control, which involves collaborating with other developers and tracking changes to scripts and

How to correctly use sessionStorage to protect sensitive data How to correctly use sessionStorage to protect sensitive data Jan 13, 2024 am 11:54 AM

How to correctly use sessionStorage to store sensitive information requires specific code examples. Whether in web development or mobile application development, we often need to store and process sensitive information, such as user login credentials, ID numbers, etc. In front-end development, using sessionStorage is a common storage solution. However, since sessionStorage is browser-based storage, some security issues need to be paid attention to to ensure that the stored sensitive information is not maliciously accessed and used.

How do PHP functions return objects? How do PHP functions return objects? Apr 10, 2024 pm 03:18 PM

PHP functions can encapsulate data into a custom structure by returning an object using a return statement followed by an object instance. Syntax: functionget_object():object{}. This allows creating objects with custom properties and methods and processing data in the form of objects.