Stripe’s Veneur: A distributed, fault-tolerant pipeline for observability data

109 points by federicoponzi 6 years ago

This was a pleasant surprise to see on Hacker News this morning! I work on the Observability team at Stripe and have been the PM for Veneur (and the rest of our metrics & tracing pipeline work) pretty much since we released it ~2 years ago.

If you're interested in learning more about how Veneur works and why we built it, I gave a talk at Monitorama last year that explains the philosophy behind Veneur[0]. In short, a massive company like Google is able to build their on integrated observability stacks in-house, but almost any other smaller company is going to be relying on an array of open-source tools or third-party vendors for different parts of their observability tooling[1]. When using different tools, there are always going to be gaps between them, which leads to incomplete instrumentation and awkward (inter-)operability. By taking control of the pipeline that processes the data, we're able to provide fully integrated views into different aspects of our observability data.

The Monitorama talk is a year old at this point, so it doesn't cover some of the newer things Veneur has helped us to accomplish, but the core philosophy hasn't changed. I've given updated versions of the talk more recently at CraftConf (in May) and DevOpsDaysMSP (last week), but neither of those videos are online yet.

[0] https://vimeo.com/221049715

[1] e.g. ELK/Papertrail/Splunk for logs, Graphite/Datadog/SignalFx for metrics, and maybe a third tool for tracing if you're lucky.

tchaffee 6 years ago

Am I the only one who is always slightly disappointed that neither the README file on Github nor the landing page at the website tells me why I would want to use the software in question? What problem it solves? Why might "a distributed, fault-tolerant observability pipeline" be interesting to programmers or anyone else? It seems like you've already got to be familiar with the problem space to understand what this is and what need it fulfills.

I'm not picking on this package. I see it all the time.

Can someone here explain to me what the use case is for this software?

jmillikin 6 years ago

The use case: you have more than a hundred machines emitting lots of monitoring data, much of which is uninteresting except in aggregate form. Instead of paying to store millions of data points and then computing the aggregates later, Veneur can calculate things like percentiles itself and only forwards those.
It also has a separate, related purpose as a statsd protocol transport. You run it on the statsd port and it receives standard (or DataDog-extended) UDP traffic, then it forwards metrics via TCP to another backend. This has reliability benefits when operating over a network that might drop UDP packets, such as the public internet.
- tchaffee 6 years ago
  
  Thanks. That's pretty good. Every README needs to start with something like this:
  "If you monitor applications, Veneur can save you money and increase reliability"
  
  mrkurt 6 years ago
  
  That sounds like a startup's elevator pitch, not a README for an OSS project.
  
  tchaffee 6 years ago
  
  It kind of does sound like the first line of an elevator pitch, doesn't it? And as a programmer it helps me quickly decide if I should be care about it at all, and how to categorize it.
  Take a look at the software this "pipeline" is supposed to integrate with. It's a commercial product. Their landing page could not be clearer:
  "Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere"
  Perfect. Now I know what they do in plain English, how to mentally categorize their product, and even how to bookmark the page for the future when I might need their product.
  React is an OSS product, and it comes close "A declarative, efficient, and flexible JavaScript library for building user interfaces."
  Tells me an awful lot with only one jargon word: declarative. But I can still categorize it mentally and bookmark it even if I don't know what that words means.
  Make it easy for people to understand what you do, and you'll get more interest. People will bother to bookmark it. And maybe come back.
abathur 6 years ago

I regularly have the same experience.
When I run into one, I optimistically assume I'd know if I needed it (but, unfortunately, it means I'm very unlikely to remember a specific package if I do eventually run into the problem.
If they don't already exist, maybe there's room in the universe for some README best-practices (like those floating around for writing commit messages, user stories, issue reports, etc.) that might nudge more maintainers to include at least one lucid example of a problem it could solve.
SuperKlaus 6 years ago

Thanks asking this. I came back here after looking at the README in the repo and was still clueless as to why I'd want to use Veneur.
mrkurt 6 years ago

It's not always possible to describe a problem in a way you will understand if you aren't already familiar with it. And that's ok! Reading something like "distributed, fault-tolerant pipeline for observability data" and ¯\_(ツ)_/¯ could be good response both for you and the people who built the thing. It's definitely ok that you have to dig a little more to wrap your head around it.
In short, it's reasonable to expect that people who see a project already understand the problem space and write for the ones who can say "yep I need this".
This particular project is probably only useful to people who know what observability data means.
- tchaffee 6 years ago
  
  I agree it's not always possible. In this case it was possible and someone did explain it in a sentence or two in another response to my question. In most cases it's possible.
  Here's the opportunity: I've been programming for decades. I'm always on the lookout for new and better solutions. I might not have a use for the software today, but if I can understand what it does and bookmark it, there is a good chance I will come back when I do need to solve that problem.
  This means a lot to me and allows me to bookmark it both mentally and in my browser:
  "This package helps reduce DataDog (<-- link) costs by aggregating data before sending and increases monitoring reliability by being distributed and fault tolerant."
  Pretty much plain English, and now I can bookmark it because I know I will be evaluating my monitoring needs in a few months.
  I read the entire README and had no idea what the software did. At first I thought it might have something to do with RxJS and observables...
  > This particular project is probably only useful to people who know what observability data means.
  Sorry if I'm repeating myself, but no, it's useful to other people. Like me. The software might be useful to me because I am interested in monitoring an app pretty soon. I've never heard someone call it "observability data".The Wikipedia article on observability left me even more confused.
  This happens a lot, and it's a mistake that can usually be fixed, which is why I wanted to point it out.
  If it can't be fixed and the software truly is extremely niche, then zero problem.
- luckydata 6 years ago
  
  Complete hogwash. It's always possible to describe a problem in a way that every reasonably educated person familiar with the field of application as long as the author makes a reasonable attempt at summarizing the following:
  1. What problem does it solve 2. How it solves it 3. How is it better than the most common alternatives to solve the same problem
  Not bothering with doing that is just OSS malpractice.
- aaronbrethorst 6 years ago
  
  Even though Albert Einstein seemingly didn't actually say this, I still think this aphorism is appropriate:
  If you can't explain it simply, you don't understand it well enough.
  
  mrkurt 6 years ago
  
  Is it? Because there's an awful lot of base knowledge you need to understand any kind of OSS project. Can you explain what nginx does to someone who doesn't have a smart phone in any kind of meaningful way?
  
  aaronbrethorst 6 years ago
  
  I have fifteen years of professional experience in software development and I didn't have the slightest clue what Veneur did from its description of "A distributed, fault-tolerant pipeline for observability data."
  - Distributed - Got this one, but it's an adjective and doesn't mean much on its own.
  - Fault-tolerant - Same.
  - Pipeline - OK, great, totally meaningless noun. I guess stuff goes in it?
  - Observability data - What the heck is observability data?
  Nginx is a web server, plus a heck of a lot more. Their website does a terrific job of explaining it. https://www.nginx.com/resources/glossary/nginx/
  
  mrkurt 6 years ago
  
  The scope of software development is so broad I think you could spend 100 years doing it and still not understand everything.
  For example, _I_ know what observability data is, but I'd have a difficult time explaining the problem Redux tackles. If you've spent most of your time building user facing apps + web apps, how would you immediately understand a problem that someone working on large scale payment infrastructure has to solve?
  
  detaro 6 years ago
  
  > but I'd have a difficult time explaining the problem Redux tackles.
  Do you think you couldn't understand a short description of the problem it tackles? Because if you could, then a short description in a readme would be valuable to you, and presumably the developers of the thing are able to explain the problem.
  
  jmillikin 6 years ago
  
  > Can you explain what nginx does to someone who doesn't > have a smart phone in any kind of meaningful way?
  Nginx was released in 2004 and the iPhone in 2007, so there's at least three years of proof that people without smartphones are still able to comprehend the purpose of a low-overhead HTTP server.
  
  mrkurt 6 years ago
  
  I didn't say you need to have a smartphone to get nginx. But someone who doesn't have a smartphone _now_ probably won't have any idea what the heck nginx does. And there's no way to Einstein it up and simplify it for them. You'd have to have a conversation.

roskilli 6 years ago

It’s definitely interesting to see the different systems being built for monitoring across the different tech co’s.

M3 aggregator, Uber’s metrics aggregation tier is similar, except it has inbuilt replication and leader election on top of etcd to avoid any SPOF during deployments, failed instances, etc. Also it uses Cormode-Muthukrishnan for estimating percentiles by default, it has support for T-Digest too. Although these days submitting histogram bucket aggregates all the way from the client to aggregator then to storage is more popular as you can estimate percentiles across more dimensions and time windows at query time quite cheaply. You need to choose your buckets carefully though.

It too is open source, but needs some help to make it plug into other stacks more easily: https://github.com/m3db/m3aggregator

dswalter 6 years ago

It always makes me happy to see approximate algorithms/data structures like hyperloglog being used.

ambicapter 6 years ago

"probabilistic" is probably the word you're looking for, and yes I agree, I'm fascinated by the idea of trading off a little bit of accuracy for massive performance gains.
- dswalter 6 years ago
  
  That's a good term. I'm sure you've found this collection of links on 'streaming algorithms'. It's a gold mine of resources in this space: https://gist.github.com/debasishg/8172796

ebikelaw 6 years ago

When I'm evaluating a system like this what I want to read about is how is it hardened against client stupidity. For example, someone deploys an application in my datacenter and it emits metrics that have gibberish in their names (consider a common Java bug where a class lacks a toString, so the metric gets barfed out as foo.bar.0xCAFEBABE.baz). How does the system cope with this enormous, hyper-dimensional input?

noncoml 6 years ago

Why is Go so popular in the industry at the moment? What's the decision process for choosing Go?

gphat 6 years ago

Hello! I'm the original author of Veneur @ Stripe and continue to work on it with a host of marvelous teammates.
I can't speak for the industry, but here's why I chose it for this project:
* I hadn't yet written anything in Go and wanted to try it on a side project / experiment
* I knew that my eventual deploy target (if the project turned out useful) would be lots of machines and I wanted to minimize the deployment requirements. Static binaries are good for that.
* I wanted to distribute the work across many cores and felt that Go's channels would make be a useful mechanism.
* I benchmarked my initial PoC against some other implementations (Python and JVM/Scala) and found no major reason to not use Go
The contributions of Stephen Jung (https://github.com/tummychow) and Aditya Mukerjee (https://github.com/chimeracoder) elevated it from a glimmer in my eye to system you can trust across your infrastructure.
So, in summary it was a confluence of interest and convenience with a strong hint of "if I use, this, it needs to be easy to deploy" and here we are 2 years later. :)
ebikelaw 6 years ago

I doubt anyone can answer this for you, but why shouldn't it be? It is a very sensible language and toolchain. When writing source, it is easy to write tests and testable code and to run the tests as part of your build. At build time, it's fast and it produces easy-to-deploy statically-linked applications. At runtime, it's pretty fast and compact ... compared to python, Go is most of the way to C++-level performance.
- dozzie 6 years ago
  
  > I doubt anyone can answer this for you, but why shouldn't it be [popular in the industry at the moment]?
  Its build system is poor if you want to rebuild the binary from the same sources (you can't precompile the libraries used), and the statically linked binary may be nice for a one-off deployment ("fire and forget" mode), but for repeated deployments and multiple versions running at the same time it reminds me the Chinese torture "death by a thousand cuts" in what you can't do with such a binary (dozens of small things that are hard to remember, each on its own not being enough to go away from static linking, but boy, they do add up).
  
  ebikelaw 6 years ago
  
  Can you remember even one? I'm interested. I've used Go at Extremely Large Scale (tm) and never thought it was terribly troublesome.
  
  dozzie 6 years ago
  
  Sure I can remember. Checking libraries the binary uses ("why this cURL fails on HTTPs? oh, it's linked against GnuTLS, that explains everything"), injecting your own library, intercepting a function (maybe syscall, or rather its libc wrapper), tracing function's execution (ltrace). All these things merely annoy if you used to have them but now you don't and it's hard to remember them all, but there's a lot of them.
  And then there's also sharing memory between different processes that use the same library. You don't have that for a statically compiled binary.
  
  ebikelaw 6 years ago
  
  All those things sound like antipatterns to me. If your application is supposed to fetch HTTPS pages then there should be an integration test for that, so you would never have to debug it in production. Having shared objects on the machine actually makes this impossible because your tests are running with different libraries than in production. Shared libraries on a machine are a non-hermetic input to your build and are to be avoided. In addition, runtime shared objects (especially of something performance-critical like crypto) inhibit all of the most important compiler optimizations like inlining. The savings from sharing text segments is small in my experience. As for ltrace, there's a million ways to trace function calls these days, like uprobe or perf.
  
  dozzie 6 years ago
  
  > All those things sound like antipatterns to me. If your application is supposed to fetch HTTPS pages then there should be an integration test for that [...]
  It's not something to regularly rely on, but something that helps in debugging and troubleshooting. Not for a programmer, but for a sysadmin.
  > Having shared objects on the machine actually makes this impossible because your tests are running with different libraries than in production.
  In such a case you have your deployment process broken. And if your testing and production environments differ in this matter, they differ enough bite your ass even with your statically linked binary.
  > Shared libraries on a machine are a non-hermetic input to your build and are to be avoided.
  This is merely stating a generic opinion. I want to see a concrete, coherent, technical argument supporting this.
  > In addition, runtime shared objects (especially of something performance-critical like crypto) inhibit all of the most important compiler optimizations like inlining.
  Especially crypto should not be called in a tight loop, but passed a large chunk of data. Otherwise you inhibit all of the most important defence against side channel attacks, and I guarantee that you are not competent enough to defent against that on your own.
  > As for ltrace, there's a million ways to trace function calls these days, like uprobe or perf.
  So let's break one of them for no good reason?
  And still, lack of any of the mentioned things is merely an annoyance once you hit it, but as I said, they are numerous and add up, while the other option, static linking, provides little benefit apart from supporting broken workflows (like different environment in testing and production).
- noncoml 6 years ago
  
  > why shouldn't it be?
  I didn't imply that it shouldn't be.
  I am genuinely interested to know the thought process and why it seems to be the defacto language these days.
  Your arguments are strong, thanks.

pinko 6 years ago

Know of anyone using this in production outside Stripe?

chimeracoder 6 years ago

> Know of anyone using this in production outside Stripe?
In addition to Intercom (mentioned in a sibling comment), some other companies off the top of my head who are also using Veneur: Sentry, Bluecore, Axiom, Quantcast, and at least two more that I'm not sure if I have permission to name publicly.
There are more too - since Veneur is an OSS project, we only find out when people submit submit PRs to the project or happen to contact us for another reason. It makes for a pleasant surprise when we find out!
otterley 6 years ago

We use it at Segment as a replacement for Datadog's Python-based (and dog slow) dogstatsd server. Veneur is not only faster on a per-CPU basis, but scales to multiple CPUs as well.
I consider it absolutely essential for collecting metrics on large instances that run hundreds of busy tasks emitting hundreds or thousands of metrics per second.
galaktor 6 years ago

Intercom use it heavily in production.
Source: am engineer at Intercom.
- chimeracoder 6 years ago
  
  > Intercom use it heavily in production. Source: am engineer at Intercom.
  This is awesome to hear! Are you referring to this Intercom? https://www.intercom.com/
  
  galaktor 6 years ago
  
  Yes, that's the one.

madspindel 6 years ago

It's from 2016: https://stripe.com/blog/introducing-veneur-high-performance-...

chimeracoder 6 years ago

> It's from 2016
2016 was the first public release, but the project has grown a lot in that time. You can take a look at the changelog to see what's new, ever since we switched to a six-week release cycle last year: https://github.com/stripe/veneur/blob/master/CHANGELOG.md
Source: I work on the Observability team at Stripe and I am the PM for Veneur.
- martinpinto 6 years ago
  
  yep, I can confirm. This is a very stable project!
mhluongo 6 years ago

Last commit on master was 5 days ago, and appears otherwise to be maintained. A lot can change in a repo in 2 years

amelius 6 years ago

What do they mean by "observability data"?

Is this a fancy way of saying "privacy-sensitive user data"?

Jarred 6 years ago

More like diagnostic stats about servers - memory usage, CPU usage and many more things like that.