What is Metrist and why did you build it?
Our mission at Metrist is to give software developers and
IT leaders the same level of visibility into the third party cloud products that
they build on as they have with the software that they build themselves. Apps
are built on top of other apps. That starts with anything from a dozen different
services at AWS to APIs like Twilio, Stripe, Easy Post and cloud tools like
CircleCI and GitHub. If one of those tools goes down, you risk going down or at
least having a degraded user experience or an inability to ship code.
The problem that we identified is twofold - one is it's really hard to either
find out about or verify that it's a third party that's causing the problem, not
you and your code, and the things you control. The second thing is it's really
hard to hold your vendors accountable to what their SLAs are. Metrist empowers
people to monitor the services that they rely on. We put the health of all of
your third party cloud dependencies into a single dashboard, alerting you about
outages typically 10 to 20 minutes before a status page gets updated. We provide
you enough details to not only answer the question is it me or is it them? But
also answer, what is the problem, will it impact me, and is there anything I can
do about it?
One of the reasons I was excited to start this was because while working at
PagerDuty, talking to people about incident response and observability, I just
kept hearing over and over again from people that their downtime is tied back to
a third party, not their own software. But I didn't see monitoring tools
changing or adapting to focus more on those things. The New Relic and Datadog
agents can tell you that there's a problem with a call to a third party, but
there was this sense of uncertainty over is it me or is it them?
Current synthetic tools such as Datadog, Grafana, New Relic hit a URL, if it
returns a certain status code, they do a thing and maybe run some logic if it
calls another thing. We go a step further where we actually stitch together an
end to end workflow of what to expect. If you are creating a bucket in S3, we
verify that the bucket exists, we then start uploading files to that bucket,
deleting things, then removing the bucket itself. If an endpoint is supposed to
send you an email or send you a webhook, we wait for those things and report
back how long it took to receive.
And then the bigger problem was holding them accountable. How do I know if they
hit their SLA last month or last quarter? We aim to solve that visibility
problem that is becoming a bigger piece of the developer's operational workload.