Site Reliability Engineer
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that Gryphon’s services—both our internally critical and our externally-visible systems – have reliability, uptime appropriate to users’ needs, and a fast rate of improvement. Additionally, SRE’s will keep an ever-watchful eye on the capacity and performance of our systems. Much of our software development focuses on optimizing existing systems, building infrastructure, and eliminating work through automation.
On the SRE team, you’ll have the opportunity to manage the complex challenges, while using your expertise in coding, complexity analysis, and large-scale system design.
SRE’s culture of diversity, intellectual curiosity, problem-solving, and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences, and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
- Engage in and improve the whole lifecycle of services—from inception and design, deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Ensure compliance with best security practices and constantly assess potential vulnerabilities
- Adhere to customers’ contractual SLA
- Bachelor’s degree in Computer Science, a related technical field involving software/systems engineering, or equivalent practical experience.
- Experience in Infrastructure as Code (Ansible or Terraform)
- PLUS: Experience coding in at least one of the following languages: Python or Go
- Experience with business critical applications to the cloud, requiring 24-7 up-time
- Knowledge of corporate IT, data centers, ticketing system implementations, monitoring software implementation, troubleshooting, and continuous improvement approaches
- Skill and knowledge in Incident Management, Service Requests, Event Management, Access Management, Change Management, Knowledge Management and Escalated Incident Management
- Expertise in implementing and troubleshooting large-scale distributed systems.
- Ability to debug, optimize code, and automate routine tasks.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- Understanding of Unix/Linux operating systems.
- Google Cloud Certification a plus
- Exceptional work ethic, with strong values and principles – takes all opportunities to go above and beyond the basic expectations
- A self-motivated, highly innovative and adaptable individual who can work under stress and meet deadlines
- An absolute commitment to customer services
- Availability for occasional off hour work related to 24×7 up-time and availability of the product suite; willingness to support the team who has on-call coverage expectations
The Site Reliability Engineer will work under the guidance of Gryphon’s VP, Operations & Network Services. Gryphon fosters a strong team environment, and the successful candidate is expected to embrace this culture.