Data center operation and maintenance

AtHub divides operation and maintenance management into several core modules for detailed management combined with its operation and maintenance experience of many years
D

AtHub operation and maintenance management concept

The operation and maintenance of data center infrastructure refers to ensuring that data center environment can meet customer SLA requirements for all kinds of facilities and equipment required for the normal operation of the computer equipment, including power supply and distribution system of machine room, air conditioning system, fire protection system, security system, etc. With the rapid growth of large-scale Internet data center exponential scale, various Internet businesses are more and more dependent on date center, and the technical characteristic of data center are changing constantly, so infrastructure operator needs to take targeted action for detailed operation and maintenance management of infrastructure.

The efficient and reliable operation and management of the data center needs to be managed from two dimensions, point and surface, and cover all aspects by point in the general direction.
From point management prospective, operation and maintenance management needs to dissect each operation condition of all sub-modules of equipment, analyze operation parameter of each sub-module, and build parameter benchmark, and achieve initiative operation and maintenance management.
From aspect management prospective, operation and maintenance management needs to cover all professional systems of data center, and clear the logic and relationship of each professional system.
In order to manage the lifecycle of data center well from point and aspect, AtHub divides operation and maintenance management into several core modules for detailed management combined with its operation and maintenance experience of many years, and these modules include:
  • Safety management
    • Event handling
    • Problem management
    • Modification management
    • Staff access management
    • Equipment access management
    • Knowledge management
  • Staff management
    • Duty arrangement
    • Attendance management
    • Staff competence level management
    • Performance management
    • Behavior analysis & emotion management
    • Star level management (incentive management)
    • Training exam management
    • Code of conduct management for outsiders
  • Construction management
    • Construction content management
    • Construction plan arrangement (maintenance plan, exercise plan, preventive maintenance plan FMEA)
    • Construction work order management (emergency repair construction management, temporary worker list)
  • Cost management
    • Electricity bill cost
    • Water bill cost
    • Fuel cost
    • Heating cost
    • Spare part cost
    • Consumables management cost
    • Other cost: man-hour fee etc.
  • Supplier management
    • Implement the match between the contract and the supplier delivery
    • Implement on-site execution management
    • Implement KPI assessment
  • Customer management
    • Customer satisfaction survey
    • Customer problem follow-up management
    • Service delivery management
    • Customer problem maintenance
  • Charging management
    • Testing cabinet management
    • Cabinet power-on
    • Bandwidth management
    • Over-electricity management
  • Equipment management
    • Basic property management of equipment
    • Static parameter management of equipment
    • Operating parameter management of equipment
    • Cascade relationship management of equipment
    • Cluster relationship management of equipment

Operation and maintenance structure system of data center

The whole data center operation and maintenance organization structure should be made up of three blocks, including operation and maintenance management team, quality management team and information management team. These three management teams accompany each other and are all indispensable. The operation and maintenance management team ensures daily system implementation and quick response, while the quality management team ensures quality supervision and risk management of operation and maintenance, and the information management team ensures that the standardization, replicability and metrization of operation and maintenance system are fully implemented.

Operation and maintenance management team

Operation and maintenance management team Be mainly responsible for management and execution of daily operation and maintenance, including first-tier and second-tier operation and maintenance support. Be mainly responsible for on-site operation and maintenance, emergency handling, and maintenance of equipment and facilities.

Quality management team

Composed of senior operation and maintenance and lean management team, and the senior operation and operation is mainly responsible for verification of each data center, major malfunction handling and preventive maintenance, which manages the whole operation and maintenance as third-tier operation and maintenance support.

Information management team

Be mainly responsible for operation and maintenance management, research and development, and daily maintenance of big data analysis platform.

Data center advanced operation and maintenance service

Advanced operation and maintenance and quality supervision

Advanced operation and maintenance
Advanced operation and maintenance is also known as senior maintenance engineer, and is divided into two majors: HVAC and electricity. HVAC senior maintenance engineer has large refrigeration equipment maintenance certificate and refrigeration equipment maintenance experience of more than twenty years, and can take the lead in repairing most equipment malfunctions; electricity senior maintenance engineers all have more than twenty years of electricity experience, and can take the lead in UPS battery discharge test, annual maintenance of diesel generator.

Senior maintenance engineers all have work experience of verification and acceptance of multiple projects, strong problem finding ability and responsibility awareness, which promote verification and acceptance work to be completed on schedule.
Quality supervision
Quality supervision is mainly responsible for inspection of daily code of conduct and on-site 6S, conducts regular and irregular unannounced inspection in accordance with the requirements of operation and maintenance management system, combines three ways of on-site, monitoring and platform to mainly inspect the following aspects:

· Operation and maintenance records: equipment operating record, efficiency record, inspection record, duty log, etc.
· Code of conduct: work discipline, dress code, etc.
· 6S management: clean machine room, items specification placement, etc.
· Fire protection security: fire protection inspection record, fire protection equipment inspection, fire hazard investigation, etc.
· Documentation: document list check, inspection of inquiry and duplicate record, confirmation of validity of on-site information

Summarize problems found by senior maintenance engineers and supervision monthly, and output monthly supervision report, contents of which include but are not limited to: problem description, on-site picture, correction opinion and time limit.
Assess customer satisfaction quarterly, collect customer opinions, and implement and follow up correction measures and results.

Risk evaluation

Cooperate with data center operation and maintenance SOP/MOP/EOP to review, such as one-channel mains electricity outage, two-channel mains electricity outage, ATS switch, breaker trip, daily start of diesel generator machine room, refrigeration equipment direct supply mode, operation of plate type heat exchanger, operation of precision power distribution cabinet and other processes. It shall be amended in case of failing to meet design principle and specification, affecting function, capacity and redundancy requirement.

Technological problems found in daily operation and maintenance work, and make and implement corresponding solutions to further improve the quality of data center infrastructure operation and maintenance. For the malfunctions and alarms discovered during monitoring and inspection which exceed the existing technical processing capabilities of the operation and maintenance, assist in major risk evaluation, provide solutions, produce normalized output after theory induction according to the event processing procedure or the notification mechanism.

Technology training

During talent construction of operation and maintenance & management team, professional engineer management talent of data center operation and maintenance must have a clear understanding of entire data center system structure and knowledge of the major, and good professional knowledge plays an important role in risk identification and processing as well as energy conservation and consumption reduction in future.
AtHub technology or senior maintenance engineer team will conduct regular (irregular) basic and thematic theory training for staffs of Operation and Maintenance Department and frontline operators, and organize corresponding test and assessment. Basic training content includes basic theory of supply and distribution of electricity, electricity load calculation principle, breaker/cable compare and selection, power distribution cabinet/UPS model selection calculation, basic air conditioning theory, air conditioning load calculation, refrigeration equipment/water pump/plate type heat exchanger model selection points, basic BA knowledge, typical structure, control logic strategy, introduction of common fire protection system of gas and water, basic knowledge of construction structure decoration etc.
Thematic training includes introduction of system design conception (capacity, redundancy, function), electricity system structure, refrigeration system structure, operation logic of cluster refrigeration station, design case of industry operation and maintenance related technology sharing, introduction and exercises of common machine room energy-saving reconstruction scheme.
In addition, for the high-frequency problems of each stage and the operation and maintenance work needs, cooperate with senior engineers, conduct problem analysis and processing as well as point-to-point thematic technology guidance jointly.

Major malfunction and technology improvement scheme support

In terms of operation and maintenance, for major malfunction occurred in data center, AtHub provides on-site support of senior operation and maintenance engineer or technology staff. In terms of technology, it focuses on supporting technological reconstruction requirements referring to function, capacity and redundancy adjustment. Considering the aspects of building plane plan, electricity, refrigeration redundancy, impact of existing machine room business, system maintain ability, and construction feasibility, provide technology improvement scheme, construction drawing, construction quantity list, equipment procurement technology specification, etc. In case of referring to modification service, it needs to cooperate with operation and maintenance submitting detailed modification scheme for customer to review in advance.

Machine room maintenance management staff evaluates operating condition of machine room electricity equipment and air conditioning regularly, evaluates and alarms in advance for parameters that are close to the threshold, provides review and evaluation opinion referring to relevant performance capacity optimization advice, and updates replacement scheme provided by operation and maintenance.

Energy efficiency management and optimization

AtHub’s understanding about energy efficiency management is the best optimization which meets SLA requirements first, and the core of energy efficiency management is control of processing.
PUE=total power consumption in data center production/power consumption of IT equipment, and the PUE optimization method is basically to reduce the molecule (total power consumption in data center production)

Energy efficiency management optimization scheme:
Determine the optimization goal: calculate PUE value in theory under different natural climate , and different IT load ratio condition of project;
Data collecting and analysis: operation and maintenance platform collects equipment operating parameter and power consumption, and divides it into four blocks, namely, refrigeration system power consumption, electricity system consumption, end air conditioning power consumption and others by corresponding calculation, and calibrate with theoretical value.
Analyze report to discover problem point;
On-site implementation: based on problem point, investigate on site, make relevant optimization measures, organize machine room operation and maintenance, and the equipment supplier conducts adjustment and optimization;
Effect evaluation: after implementation, analyze data, and evaluate whether the optimization effect meets expected goal;
Fully execute: summarize optimization scheme, train for machine room on-site operation and maintenance, energy efficiency management awareness and energy efficiency management optimization method.
Optimization method:
Through design basis and actual experience of operation and maintenance, make reasonable, normal operating condition of refrigeration equipment refrigerating mode, partly natural cooling mode and totally natural cooling mode;

· Cooling system: cooling tower fan frequency, water pump frequency, temperature control of ooling,freezing water, precision air conditioning fan frequency, water valve, temperature control, freezing cabinets COP optimization, fresh air system control;
· Electricity system: lighting management and control, equipment energy-saving mode enabled such as UPS, and HVDC;
· Others: maintenance work such as pipeline filter cleaning, cooling tower cleaning, replacement of air conditioning filter, machine room sealed;
· New technology: direct air cooling technology, indirect air cooling technology, plate cooling technology, and liquid cooling technology.