SITE RELIABILITY ENGINEER
Job Description
- We are seeking a skilled and experienced Site Reliability Engineers (SRE) to join our team.
- The successful candidate will be responsible for working to SRE principles and practices within the organization within a global round-the-clock reliability team. There will likely be some shift-based working required.
- They will be responsible for performing a variety of technical activities on the Cloud Platform such as for patching and maintaining Kubernetes clusters on at least one of GKE, EKS, or AKS, and will need to possess experience in a scripting language (Bash, Python or PowerShell).
- They should also have experience with Terraform, Helm and ArgoCD, with optional but beneficial experience in Istio, Rancher, Grafana, Grafana Mimir, Prometheus, OpenSearch and OpenTelemetry.Additionally, the candidate should have experience patching security vulnerabilities in any workloads, monitoring and responding to alerts generated, and optimizing our estate.
- The ideal candidate will have experience working in a DevOps environment, a strong understanding of cloud computing and Kubernetes, and a demonstrated ability to implement SRE principles.
Responsibilities Duties:
- Apply SRE principles and practices within the organization.
- Patching and maintaining Kubernetes clusters on at least one of GKE, EKS, or AKS.
- Patching security vulnerabilities in any workloads.
- Monitoring and responding to alerts generated by our monitoring systems.
- Optimizing our Kubernetes estate to improve performance and efficiency.
- Collaborating with development teams to improve deployment processes and practices.
- Troubleshooting issues with Kubernetes clusters and workloads and owning solutions through to resolution.
- Creating and maintaining documentation on Kubernetes configurations and processes.
- Participating in on-call rotation and responding to incidents as needed.
Key Skills:
- Strong experience in managing Kubernetes clusters in production environments.
- Experience with GKE, EKS, or AKS is required.
- Strong understanding of cloud computing and Kubernetes.
- Experience with monitoring and alerting tools such as Prometheus and Grafana.
- Familiarity with containerization technologies such as Docker.
- Experience with Bash, Python or PowerShell scripting and automation.
- Experience with Terraform, Helm, and ArgoCD.
- Optional but beneficial experience with Istio, Rancher, Prometheus, Grafana Suite, OpenSearch and OpenTelemetry.
- Experience adhering to SRE principles and practices.
- Strong problem-solving and analytical skills.
- Good verbal and written communication skills
Experiance Qualifications:
- Strong experience in managing Kubernetes clusters in production environments.
- Experience with GKE, EKS, or AKS is required.
- Strong understanding of cloud computing and Kubernetes.
- Experience with monitoring and alerting tools such as Prometheus and Grafana.
- Familiarity with containerization technologies such as Docker.
- Experience with Bash, Python or PowerShell scripting and automation.
- Experience with Terraform, Helm, and ArgoCD.
- Optional but beneficial experience with Istio, Rancher, Prometheus, Grafana Suite, OpenSearch and OpenTelemetry.
- Experience adhering to SRE principles and practices.
- Strong problem-solving and analytical skills.
- Good verbal and written communication skills
Benefits:
- Benefits: This may include training, health, insurance, commuting support, lunch service etc.