OVERVIEW
Highly skilled, hands-on technical engineer with demonstrable success maintaining high-availability, large-scale enterprise/cloud services. Innovative problem solver with proven leadership and mentoring abilities. Long track record of delivering substantial return on investment to employers and clients. A commitment to keeping up to date with the latest developments in the industry.
EXPERIENCE
Senior Site Reliability Engineer AJW Group |
2018 - present Sussex, UK |
Responsibilities
• Work alongside a geographically distributed team of Developers and Infrastructure Engineers for AJW Group, a world-leading independent specialist in the global management of commercial and business aircraft spares
• Lead and developed the culture of SRE within the Organisation, implemenation of Automated Incident Management across services
• Lead development of tools,automation to facilitate production system uptime and achieving product SLA
• Defining service SLA / SLOs of services
• Feature development, enhancements for the Kubernetes PAAS platform
• On Call activities, Incident management and Postmortem efforts for the platform
• Lead One Click Deployment of PAAS Infrastructure, auto-remediation / repairing of Infrastructure
• Ensure health of production systems, investigate anomalous behaviour and triage outages, shepherd code changes from development to production, develop and enhance automation and monitoring tools
• Provide technical leadership in cross-organizational projects
• Serve as escalation point for troubleshooting critical problems and unexpected operational issues
Accomplishments
• Documented achievement of service availability exceeding 99.99%
• Produced detailed service metrics, allowing consistently accurate utilization projections; variance from norm in metrics used as an early-warning mechanism for detecting problems/changes in behaviour
• Developed benchmarking tools for system analysis and optimization; allowed detailed performance testing of new hardware and software configurations outside of actual production environment
• Established a common monitoring and reporting framework which facilitated the rapid development and deployment of new services
• Established a configuration management toolkit for enforcing operational best-practices throughout the organization
• Originally joined AJW Group as a Cloud Engineer, elevated to Senior SRE within a year of hire.
• Successfully transitioned production deployment and on-call/triage responsibilities to SRE team; created documentation for SRE ramp-up and critical job functions, including prod deployment process/checklist
• Managed successful delivery of new production cloud architecture; developed system validation and performance benchmarking tools; streamlined validation and deployment processes
• Became highly proficient with Kubernetes, an open-source system for automating deployment, scaling, and management of containerized applications; helped develop Kubernetes best practices, identified bugs and suggested new features
• Implemented standards for incident tracking, documentation, and post-mortems
• Awarded Kubernetes Certified Administrator (CKA) Certification
Cloud Engineer AJW Group |
2017 - 2018 Sussex, UK |
Responsibilities
• Architect new services, re-architected existing services, and conceived new features and functionality
• Incident Management and resolution
• Troubleshooting and triaging operational and application issues and fixing them within the defined SLA
• Infrastructure Capacity Management
• Infrastructure and application monitoring / logging
• Production upgrades / updates / patching
• Ensuring that support calls were logged and handled effectively / efficiently within agreed Service Level Agreements using ITIL compliant service desk applications
Accomplishments
• Implemented monitoring, alerting, and code delivery mechanisms which stabilized service reliability and reduced downtime by an order of magnitude in less than 1 month after taking over AWS.
• Led effort to establish common Terraform infrastructure for all AJW Group cloud services.
Support Engineer Equinix / Telecity |
2013 - 2017 London, UK |
Responsibilities
• Ensuring that support calls were logged and handled effectively / efficiently within agreed Service Level Agreements using ITIL compliant service desk applications.
• Worked in a team as part of 24/7 network operations centre for Equinix, a global managed services provider, supporting mission critical datacenter infrastructure across the globe.
• Ensuring health of production systems, investigate anomalous behaviour and triage outages.
• Monitoring the progress of live support tickets with third-party maintenance contract suppliers.
• Monitoring of internal and customer hardware, working with external hardware vendors and internal teams to remediate hardware and configuration issues.
• Working with network carriers to troubleshoot customer and internal networks. Configuration changes carried out on a broad range of core network cisco equipment, including ASR Service Provider border routers and access switches.
• Rule checks on customer security hardware including Cisco and Checkpoint firewalls.
• Deployment of new physical and virtual servers. OS patching, configuration and troubleshooting of VMWare ESXi hypervisors and virtual infrastructure management for both internal and customer environments.
• DDoS attack mitigation and threat management of customer and internal IP Networks.
Accomplishments
• Implemented standards for source code management using Git and Gitlab.
EDUCATION
CNCF - Certified Kubernetes Administrator (CKA) |
2020 |
Arborventure LTD, UK - (CS38) Tree Climbing and Aerial Rescue |
2006 – 2007 |
Solent University, UK - Cisco Certified Network Associate |
2005 – 2006 |
University of Portsmouth, UK - Computer Network Management & Design |
2003 – 2005 |
TECHNICAL EXPERTISE
Software - Git, GitHub, Gitlab, Docker, Kubernetes, Terraform, Cloudformation, Grafana, Prometheus
Operating Systems - Linux, Docker, Windows Server, MacOS
Programming - Go, Bash, Python, SQL, HTML, CSS
Cloud Vendors - AWS, GCP, Azure