Cornell University seeks an experienced applications programmer for arXiv to serve as Service Reliability Engineer for its next generation system. arXiv is the premier open access platform serving scientists in physics, mathematics, computer science, and other disciplines. For over 25 years, arXiv has enabled scientists to share papers within scientific communities and to publish "pre-prints” that are scientific papers shared prior to a paper being published. Around the world, arXiv is recognized as an essential resource for the scientists it serves.
The arXiv team is adopting practices that improve the reliability of deployment, the ability to respond to increases in traffic, and the ability to observe, monitor, and troubleshoot the system. The primary activity involved in these changes is designing, coding, testing, and debugging highly complex programs that control the infrastructure and configurations that form the backbone of the arXiv platform. This work also has strong ties to site security, including detecting malicious activity in the system, and ensuring compliance with security and data protection best practices.
This programmer will be a full-time member of the arXiv IT team, reporting to the IT Team Lead, and will collaborate with team members on the design and implementation of programs, configurations, and workflows to test, deploy, monitor, and scale the arXiv software system.
- Design, code, test, debug, document, and maintain highly complex systems for deploying, monitoring, and scaling software that supports the arXiv platform.
- Design and validate test routines and schedules for deployment, monitoring, and scaling processes.
- Analyze and support end-user needs and experiences related to service reliability.
- Analyze system errors, including security faults and missed performance goals. Plan and implement solutions to address these system faults.
- Identify risks and threats related to the availability and security of software components and the arXiv system as a whole, and design and implement strategies to mitigate those risks.
- Collaborate with members of the IT team to research, identify, plan, and lead the adoption of practices and technologies that increase service reliability and reduce the complexity of operating the arXiv system.
- A Bachelor’s degree or equivalent experience
- 5-7 years of relevant experience (developing, deploying, monitoring web applications)
- Demonstrated aptitude for collaboration and open communication
- Experience developing and deploying production services based on open source technologies and tools
- Demonstrated aptitude for quickly learning new tools and technologies
- Proficiency with managing AWS services to support production systems
- Experience with DevOps practices and tools, such as Ansible, Puppet, TerraForm
- Experience running Docker containers in a production environment
- Experience operating Kubernetes in a production environment
- Knowledge of security best practices for distributed online systems
- Knowledge of or experience with the ELK (Elasticsearch, Logstash, Kibana) stack and related technology
- Knowledge of or experience with Helm
- Experience developing and deploying web applications using Python web frameworks such as Flask, Django, or Pyramid
Last updated: Friday, May 17, 2019 14:46 UTC