The mission of SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is formed by combining traditional software engineering and technical operation. The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.

Job Description:

  • Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance
  • Setup, manage and maintain Shopee product/middleware/big-data applications and services
  • Perform regular and ad-hoc server-side deployments, performance fine-tuning and troubleshooting
  • Design and develop automated technical operation platform
  • Capacity and Resource management
  • Responsible for the full-chain stress test to enhance the performance and remove redundancy of applications.
  • Prepare routine operation documentation

Requirements:

  • Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields
  • Extensive and hands-on knowledge with Linux operating system (Ubuntu, CentOS, etc.)
  • Knowledge of Computer Network (TCP/IP, DNS, etc.), Computer Organisations and OS
  • Hands-on experience with at least one of the programming languages: Bash, Python, Go
  • Strong analytical and problem-solving skills with the ability to thrive under difficult and stressful situations
  • Passion and high sense of responsibility for work
  • Fast learning ability and a good team player
  • Detailed-oriented, cautious and prudent
  • Open to fresh graduates who are passionate about technical operations of internet products, Linux OS and OpenSource

Skills below are optional but preferable:

  • Experience with automation tools like Ansible, SaltStack
  • Experience with monitoring tools like Prometheus, Zabbix, Grafan etc
  • Experience with load balancing tools like LVS, Nginx, Openresty or HAProxy
  • Experience with container technology such as Docker, Kubernetes
  • Experience with High Availability system design and Server Deployment Process
  • Experience with SRE
  • Experience with Ops Paas platform or Ops automation platform (ie:CMDB)
Save