How do you keep a data-intensive, real-time service that monitors hundreds of thousands of servers up-and-running around the clock?

How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?

What should the infrastructure look like when Datadog monitors millions of servers and containers? If you these are problems that you find interesting and want to work on, apply to work on the SRE team!

What you will do

What we're looking for

Bonus Points