Responsibilities
1. Ensure the reliability and normal operation of multiple core systems for online computing, while paying attention to system capacity and stability 2. Build automated operation solutions for large systems cooperate with system development teams to ensure system reliability throughout the entire life cycle from system design to launch 3. Improve system visibility by monitoring system component availability and performance indicators, and help system development and teams quickly locate faults 4. Promote the improvement of service reliability, scalability and performance optimization to ensure system SLA participate in the design and implementation of an automation platform that can ensure the rapid iteration of large-scale online clusters 5. Based on business usage scenarios, deeply optimize and provide the best service governance practices, including but not limited to key link performance bottleneck analysis, business problem location and troubleshooting, and promote the transformation and upgrade of system high-availability architecture 6. Participate in online service optimization projects, such as improving resource utilization, improving commercialization structure, and reducing costs.
Qualifications
1. Bachelor degree or above, computer-related major 2. 1-3 years of work experience in the R&D field, with solid basic knowledge of computer software, and understanding of the principles of Linux operating system, storage, network IO, etc. 3. Familiar with one or more programming languages, such as Python/Go/Java/Shell/Ansible 4. Have the ability to solve problems in a systematic way, good communication skills and a sense of ownership 5. Those with relevant computing/distributed/big data system experience are preferred, such as Kubernetes/Docker/Spark/Flink, etc. 6. Those with algorithmic thinking, good data structure and system design capabilities are preferred.