Site Reliability Engineer

💰 취업축하금 50만원

기술스택

AWS
GCP
AZURE
Python
Grafana
Prometheus

주요업무

• Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
• Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
• Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
• Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring the customers’ infrastructure can handle increasing workloads.
• Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
• Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
• Customer Focus: Working closely with the AI program manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
• Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.

자격요건

• Bachelor's degree in computer science, engineering, or a related discipline
• 5+ years of industry experience as a Site Reliability Engineer
• Experience with cloud platforms (e.g., AWS, GCP, Azure).
• Experience with scripting languages (e.g., Python).
• Experience with monitoring and alerting tools (e.g., Prometheus, Grafana).
• Experience in ticket management, issue resolution, and troubleshooting
• Strong problem-solving and troubleshooting skills.
• Ability to work independently and as part of a team.
• Excellent customer communication and interpersonal skills.

우대사항

• Knowledge of containerization technologies (Docker, Kubernetes).
• Knowledge of AI/ML infrastructure and workloads.
• Knowledge of big data technologies (Hadoop, Spark).
• Fluency in verbal and written English

복지 및 혜택

• Annual medical check-up
• Group accident insurance
• Gym support
• Overtime support
• Documentation support(Grammarly)
• Self development allowance
• Growth support

채용절차 및 기타 지원 유의사항

Application Review ＞ Phone Interview ＞ (Virtual) Onsite Interview ＞ CEO Interview and Core Values Interview ＞ Offer

경력: 경력 5~10년

학력: 대학교졸업(4년) 이상

마감일: 2025-03-13

근무지역

서울시 강남구테헤란로201,아주빌딩7층
지도보기

기업/서비스 소개

기업상세 정보로 이동

가우스랩스(GaussLabs)_Site Reliability Engineer

Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at the customer site.

Site Reliability Engineer

포지션 상세 정보

포지션 경력/학력/마감일/근무지역 정보

기업/서비스 소개