포지션 상세 정보
- 기술스택
GCP
AWS
Prometheus
Python
AZURE
Grafana
- 주요업무
• Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events. • Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly. • Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency. • Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring the customers’ infrastructure can handle increasing workloads. • Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times. • Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations. • Customer Focus: Working closely with the AI program manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction. • Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.
- 자격요건
• Bachelor's degree in computer science, engineering, or a related discipline • 5+ years of industry experience as a Site Reliability Engineer • Experience with cloud platforms (e.g., AWS, GCP, Azure). • Experience with scripting languages (e.g., Python). • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana). • Experience in ticket management, issue resolution, and troubleshooting • Strong problem-solving and troubleshooting skills. • Ability to work independently and as part of a team. • Excellent customer communication and interpersonal skills.
- 우대사항
• Knowledge of containerization technologies (Docker, Kubernetes). • Knowledge of AI/ML infrastructure and workloads. • Knowledge of big data technologies (Hadoop, Spark). • Fluency in verbal and written English
- 복지 및 혜택
• Annual medical check-up • Group accident insurance • Gym support • Overtime support • Documentation support(Grammarly) • Self development allowance • Growth support
- 채용절차 및 기타 지원 유의사항
Application Review > Phone Interview > (Virtual) Onsite Interview > CEO Interview and Core Values Interview > Offer
기업/서비스 소개
기업상세 정보로 이동
Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at the customer site.