Table of Contents
About the tool platform
Everyone will intuitively feel that: European and American companies are more willing to purchase SaaS services, while domestic companies are more willing to build their own services based on open source. Is it because the domestic company philosophy is not good? Not really. The core problem is the lack of reliable ToB companies and products in many domestic fields. Imagine if a ToB company could provide Party A with:
About emergency fault handling
About change release
About cost optimization
Summary
Home Operation and Maintenance Safety To end this topic: Is it true that operation and maintenance jobs can no longer be done?

To end this topic: Is it true that operation and maintenance jobs can no longer be done?

Jun 09, 2023 pm 06:57 PM
Operation and maintenance

To end this topic: Is it true that operation and maintenance jobs can no longer be done?

Last Friday, Ma Chi and Lai Wei had an online exchange. The topic was, are operation and maintenance positions really no longer available? As the host, I am both the igniter and the facilitator :) I benefited a lot from listening to the two veterans share some of their respective opinions. Make sure to record it today so as not to forget it. It is a review of the live broadcast.

About the tool platform

The tool platform will replace part of the labor force. This is actually obvious and needs no further explanation.

But who will build the tool platform? This is worth checking out. Monitoring systems, CI/CD platforms, chaos engineering platforms, middleware services, etc. are all Platforms and are built by Platform Engineer, referred to as PE. PE is obviously divided into many groups, and each PE group is responsible for a limited number of platforms. These scattered PE teams can be organized into a large team, such as the infrastructure team, or they can be split into multiple teams. For example, the PE team related to engineering performance can be placed in one department (such as the performance engineering department), database, and big data. The relevant PE teams are placed in one department (such as the data department), and the PE teams related to stability assurance are placed in one department (such as the operation and maintenance department).

The division of this organization may be different in different companies, but the relationship is not very important. The key is how the PE team should carry out its work? The core of the PE team must do the following:

  • Build a useful platform and allow the business R&D team to provide self-service
  • The platform must accumulate best practices. The platform needs to satisfy the business, but it must also have industry best practices. In theory, if business needs conflict with industry best practices, industry best practices should prevail as much as possible. If it is really impossible to do so in the short term, it should also formulate We must implement the plan step by step and strive to achieve it in the future. Otherwise, if there are more and more individual things and anti-pattern things, the Platform side will become more and more uncomfortable. In the end, it will be overwhelmed and it will be overthrown and started all over again. ##We must find ways to use the platform to implement specifications instead of using rules and regulations. Ma Chi gave a good example. They have a specification that requires business programs not to use local disks to store state data. They do not take this as a red line. The decree was promulgated, but it was clearly told to the business side that the container would be restarted regularly to allow the container to drift! In fact, people who have used AWS should know that AWS virtual machines sometimes restart inexplicably. It is the responsibility of application developers to provide highly available applications for unreliable infrastructure
  • Requires COE ( Domain experts) to guide the evolution of the Platform, because architects who are good at databases may not be good at Hadoop, architects who are good at Hadoop may not be good at observability systems, and architects who are good at observability systems may not be good at chaos engineering.
  • But not all Platforms are created overnight. What should we do if we don’t have these Platforms yet? The company should recruit a COE first, and let the COE serve as a business consultant while building the Platform's capabilities. The business is developing rapidly, and self-development of the Platform is too slow. It can also seek solutions from external suppliers. Even the COE itself can seek external solutions, depending on the situation.

About external suppliers

Everyone will intuitively feel that: European and American companies are more willing to purchase SaaS services, while domestic companies are more willing to build their own services based on open source. Is it because the domestic company philosophy is not good? Not really. The core problem is the lack of reliable ToB companies and products in many domestic fields. Imagine if a ToB company could provide Party A with:

Excellent, advanced methodology
  • Stable, easy-to-use products
  • Excellent, A stable customer success team helps customers better implement best practices
  • In terms of price, it is cheaper than Party A’s own recruitment of personnel and self-research
  • As long as the CXO’s brain is not broken, it will definitely Will choose to bring in such external suppliers. But is there such a ToB company? This is a big question mark. We created Kuaimao Nebula to provide customers with observability products and strive to become such a supplier. I hope ToB colleagues in the industry will work together!

Expanding on the issue of career selection, although there may not be a good supplier in a certain segment now, what about three years from now? What about 5 years from now? Have foreign countries already taken the lead? Are there any suppliers with good potential in China? If you already have it, brother, do you still dare to continue to devote yourself to this niche field? Should we have made some plans in advance?

Of course, we are usually too optimistic or too pessimistic about our future estimates. Our estimates of time are usually either too advanced or too late. That's right, brother, it depends on how you judge.

About emergency fault handling

Should OnCall fault response be handled by R&D? Or operation and maintenance? This question is very interesting. Ma Chi believes that 80% of online faults are related to changes. Changes are made by R&D, and R&D is obviously more familiar with them. Let R&D respond to OnCall faults, which means that R&D can respond faster to 80% of the problems.

Business development is like this. Database changes, basic network changes, and access layer changes are all the same. It seems more reasonable for the person who makes the change to respond to the fault alarm of his own service.

Actually, this depends on two premises:

  1. Monitoring and observability are done well enough, and problems caused by changes can be discovered in time through this platform. Come on, everyone, I hope every company has a complete observability system
  2. Problems introduced by changes are reflected immediately. If some problems introduced by changes only appear after a week, it will be difficult for the person who made the change to doubt themselves.

In fact, we can treat it in two situations. The service stability monitoring after the change is the responsibility of the person who made the change. Daily OnCall is another scenario and should be treated separately. So who should do the daily OnCall? It should be those who can directly participate in fault location and stop loss. The reason is obvious. If the OnCall person receives an alarm and needs to contact others, then the timeliness of the fault stop loss will be too poor.

So first of all, the alarms should be processed in different categories, and different people will OnCall different alarms. It is unreasonable to give all alarms to R&D or to operation and maintenance. This absolute approach is unreasonable.

About change release

There is a consensus on the ultimate goal, which is to allow business research and development to release versions freely, but we also hope to control it, hope to release safely, and hope to protect the business while releasing. Continuity. This puts extremely high requirements on the CI/CD system.

If you don't care, changing the bottom layer of the system is just a matter of running a script in batches on a batch of machines. But after adding the above requirements, it becomes much more difficult and becomes a systematic project.

On the business research and development side, it is necessary to make observable points and monitor the system to detect problems in time, and even automatically block the release process after an alarm. There needs to be some means of blue-green release and canary release, and some automatic code scanning and security scanning capabilities are needed. The tool system is incomplete. It is inappropriate to blindly require R&D to ensure that changes can be rolled back and that changes are safe. The level of CI/CD capabilities can basically tell the technical strength of the company.

If your company still provides R&D with bills of lading for operation and maintenance, and operation and maintenance operates online, you should consider whether this is reasonable. Of course, the above approach is more Internet-oriented and may not be suitable for all companies. This live broadcast only provides an idea, and you have to consider it yourself.

Of course, how to achieve this ideal situation? How should we go about it step by step before this ideal situation is achieved? The issue of time was not discussed in the live broadcast. If the company's business is suitable for running on Kubernetes, it is relatively easy to build such a system using Kubernetes, and you can take action as soon as possible. If the company's business must run in a physical machine or virtual machine environment, then first create a unified change release platform, and then fill in the gaps and gradually improve them.

About cost optimization

The two guests didn’t talk much, but everyone was very cautious about this matter. Remind everyone:

  1. People are more expensive than hardware. Never do something that costs 50 million in manpower and saves 40 million in hardware costs.
  2. Leave enough redundancy for the business Spare computing power, if the resources are too tight and the budget for the batch is not approved, if the capacity causes a failure, the customer experience will be damaged, the public opinion will be negative, and the gain will outweigh the loss.
  3. The ridiculous example is, buying with 30 million , in order to save the hardware cost of 3 million yuan, I can’t resist the volume, so I really lost it

Summary

At this stage, the platform system is not so complete yet, use the self-service Platform COE The architecture of BP (Business Partner) to build an operation and maintenance system seems to be reliable and implementable. In the future, when the Platform is good enough, BP manpower can be reduced (BP has gradually gained the ability to do COE). If the Platform continues to be complete, COE can continue to be reduced. After that, well, operation and maintenance and R&D may not be needed.

The above is the detailed content of To end this topic: Is it true that operation and maintenance jobs can no longer be done?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Having worked in operation and maintenance for more than ten years, there have been countless moments when I felt like I was still a novice... Having worked in operation and maintenance for more than ten years, there have been countless moments when I felt like I was still a novice... Jun 09, 2023 pm 09:53 PM

Once upon a time, when I was a fresh graduate majoring in computer science, I browsed many job postings on recruitment websites. I was confused by the dazzling technical positions: R&D engineer, operation and maintenance engineer, test engineer...‍ During college, my professional courses were so-so, not to mention having any technical vision, and I had no clear ideas about which technical direction to pursue. Until a senior student said to me: "Do operation and maintenance. You don't have to write code every day to do operation and maintenance. You just need to be able to play Liunx! It's much easier than doing development!" I chose to believe... I have been in the industry for more than ten years , I have suffered a lot, shouldered a lot of blame, killed servers, and experienced department layoffs. If someone tells me now that operation and maintenance is easier than development, then I will

Spring Boot Actuator Endpoint Revealed: Easily Monitor Your Application Spring Boot Actuator Endpoint Revealed: Easily Monitor Your Application Jun 09, 2023 pm 10:56 PM

1. Introduction to SpringBootActuator endpoint 1.1 What is Actuator endpoint SpringBootActuator is a sub-project used to monitor and manage SpringBoot applications. It provides a series of built-in endpoints (Endpoints) that can be used to view the status, operation status and operation indicators of the application. Actuator endpoints can be exposed to external systems in HTTP, JMX or other forms to facilitate operation and maintenance personnel to monitor, diagnose and manage applications. 1.2 The role and function of the endpoint The Actuator endpoint is mainly used to implement the following functions: providing health check of the application, including database connection, caching,

What capabilities should PG database operation and maintenance tools cover? What capabilities should PG database operation and maintenance tools cover? Jun 08, 2023 pm 06:56 PM

Before the holidays, I collaborated with the PG China community to conduct an online live broadcast on how to use D-SMART to operate and maintain the PG database. It happened that one of my clients in the financial industry listened to my introduction and called over to chat. They are selecting database Xinchuang and have tried several domestic databases. Finally, they are going to choose TDSQL. I felt a little surprised at the time. They had been selecting domestic databases since 2020, but it seemed that the initial experience after using TDSQL was not very good. Later, after communication, I learned that they had just started using TDSQL's distributed database and found that the research and development requirements were too high, so they all chose TDSQL's centralized MYSQL instance. After using it, they found that it was very easy to use. The entire database cloud

Spring Cloud microservice architecture deployment and operation Spring Cloud microservice architecture deployment and operation Jun 23, 2023 am 08:19 AM

With the rapid development of the Internet, the complexity of enterprise-level applications is increasing day by day. In response to this situation, the microservice architecture came into being. With its modularity, independent deployment, and high scalability, it has become the first choice for enterprise-level application development today. As an excellent microservice architecture, Spring Cloud has shown great advantages in practical applications. This article will introduce the deployment and operation and maintenance of SpringCloud microservice architecture. 1. Deploy SpringCloud microservice architecture SpringCloud

What is observability? Everything a beginner needs to know What is observability? Everything a beginner needs to know Jun 08, 2023 pm 02:42 PM

The term observability originates from the engineering field and has become increasingly popular in the software development field in recent years. Simply put, observability is the ability to understand the internal state of a system based on external outputs. IBM defines observability as: Generally, observability refers to the degree to which the internal state or condition of a complex system can be understood based on knowledge of its external output. The more observable the system is, the faster and more accurate the process of locating the root cause of a performance issue can be without the need for additional testing or coding. In cloud computing, observability also refers to software tools and practices that aggregate, correlate, and analyze data from distributed application systems and the infrastructure that supports their operation in order to more effectively monitor, troubleshoot, and debug application systems. , thereby achieving customer experience optimization and service level agreement

Tuyou Zou Yi: How to operate and maintain small and medium-sized companies? Tuyou Zou Yi: How to operate and maintain small and medium-sized companies? Jun 09, 2023 pm 01:56 PM

Through interviews and submissions, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting the industry to move forward better. In this issue, we invite Zou Yi, the operation and maintenance director of Tuyou Games. Mr. Zou often jokingly calls himself the operation and maintenance representative of the world's top 5 million companies. It can be seen that in his heart, he feels that the operation and maintenance construction ideas of small and medium-sized companies are different from those of large enterprises. There are differences. Today we have a few questions and ask Mr. Zou to share his journey of integrating research and operations for small and medium-sized companies. This is the 6th issue of the down-to-earth and high-level "Operation and Maintenance Forum", starting now! Question Preview Tuyou is a game company. What do you think are the unique features of game operation and maintenance? What are the biggest operational challenges you face? How did you solve these challenges? Game operation and maintenance people

Do you need to learn golang for operation and maintenance? Do you need to learn golang for operation and maintenance? Jul 17, 2023 pm 01:27 PM

Don’t learn golang for operation and maintenance. The reasons are: 1. Golang is mainly used to develop applications with high performance and concurrent performance requirements; 2. The tools and scripting languages ​​commonly used by operation and maintenance engineers can already meet most management and Maintenance requirements; 3. Learning golang requires a certain programming foundation and experience; 4. The main goal of the operation and maintenance engineer is to ensure the stability and high availability of the system, not to develop applications.

Du Xiaoman and Chen Cunli: 20-year-old 'commander' talks about operation and maintenance, performance and growth Du Xiaoman and Chen Cunli: 20-year-old 'commander' talks about operation and maintenance, performance and growth Jun 09, 2023 am 09:56 AM

Through interviews and submissions, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting the industry to move forward better. In this issue, we invite Chen Cunli, general manager of Du Xiaoman System Operation and Maintenance Department. He has spent most of his 20-year career in the Internet field. During his time in the Baidu Operations and Maintenance Department, his team members called him "Commander Chen" due to his excellent leadership style. Today we invite "Commander Chen" to talk about his views. This is the 5th issue of the down-to-earth and high-level "Operation and Maintenance Forum", starting now! Question preview: You joined Baidu very early and later became independent with Du Xiaoman. We understand that many employees around you have been following you for a long time and have experienced many business operation and maintenance tests. I believe everyone is very interested.

See all articles