Console

Cloud infra

S04 E11

2023-07-06

Cloud infra - a devtools discussion with Kurt Mackey (Fly.io). Console Devtools Podcast: Episode 11 (Season 4).

Episode notes

In this episode, we speak with Kurt Mackey, CEO of Fly.io. We discuss what it's like running physical servers in data centers around the world, why they didn't build on top of the cloud, and what the philosophy is behind the focus on pure compute, networking, and storage primitives. Kurt sheds light on the regions where Fly.io is most popular, why they’re adding GPUs, and the technology that makes it all work behind the scenes.

Things mentioned:

About Monica Sarbu

Kurt Mackey is the CEO of Fly.io, a company that deploys app servers close to your users for running full-stack applications and databases all over the world without any DevOps. He began his career as a tech writer for Ars Technica and learned about databases while building a small retail PHP app. He went to Y Combinator in 2011 where he joined a company called MongoHQ (now Compose) that hosted Mongo databases which he sold to IBM, before turning his attention to building Fly.io.

Highlights

Kurt Mackey: The original thesis for this company was there's not really any good CDNs for developers. If you could crack that, it'd be very cool. The first thing we needed was servers in a bunch of places and a way to route traffic to them. What we wanted was AnyCast, which is kind of a part of the core internet routing technology. What it does is it offloads getting a packet to probably the closest server, to the internet backbones almost. You couldn't actually do AnyCast on top of the public cloud at that point. I think you can on top of AWS now. So we were sort of forced to figure out how to get our IPs, we were sort of forced into physical servers for that reason. For a couple of years, it felt like we got deeply unlucky because we had to do physical servers. You’d talk to investors, and they'd be like, “Why aren’t you just running on the public cloud and then saving money later?” Then last year, that flipped. Now, we're very interesting because we don't run on the public clouds.

Kurt Mackey: I think there's another thing that we've probably all reckoned with since 2011; a lot of the abstractions were wrong. As the front end got more powerful, I think we tried a lot of different things for— and what we ended up doing was inflicting this weird distributed systems problem on frontend developers. So I think that, in some ways, we just have the luxury of ignoring a lot of things that people have been trying to figure out for 10 years because we probably think that's wrong at this point. So we happen to be doing well at a time when server-side rendering is all the rage in a front-end community, which is perfect for us and nobody really cares about shipping static files around in the same way. I think it's just evolutionary. We kind of have a different idea of what's right now and can do simpler things and then we'll probably get big and complicated in 10 years and be in the same situation again.

David Mytton [00:00:05]: Welcome to another episode of the Console DevTools Podcast. I'm David Mytton, CEO of Console.dev, a free weekly email digest of the best tools and beta releases for experienced developers.

Jean Yang [00:00:16]: And I'm Jean Yang, CEO of Akita Software, the fastest and easiest way to understand your APIs.

David Mytton [00:00:22]: In this episode, Jean and I speak with Kurt Mackey, CEO of Fly.io. We start with what it's like running physical servers in data centers around the world, why they didn't build on top of the cloud, what the philosophy is behind the focus on pure compute, networking and storage primitives, and the technology that makes it all work behind the scenes. We're keeping this to 30 minutes. So let's get started.

David Mytton [00:00:46]: We're here with Kurt Mackey. Let's start with a brief background. Tell us a little bit about what you're currently doing and how you got here.

Kurt Mackey [00:00:53]: Hi. So currently, we've made the bold move of trying to build a new public cloud, which is not always the best choice, but it has its moments of being amazing. How I got here is— We're getting into decades, which is always embarrassing to talk about. But basically, you know that formative moment you remember. I remember being on the internet in 1999. It was interesting and new. I was like – you could overclock your Celeron CPUs and get better gaming performance if you knew what the internet was. I was sort of drawn to it, I think.

In hindsight, what's interesting is I didn't want to just read things, I actually wanted to make things. So I actually found a site called Ars Technica back when it was two Ph.D. students at Harvard. It had a little bit of a community. We were working on stuff. I was like, “I think I want to be a writer.” So I wrote a review of the first Mozilla browser on there, which was notable because I said I think browser tabs are stupid, and they're probably not going to go anywhere, which was not my best prediction.

It kind of grew from there. I’d just find stuff to work on and play with, like different parts of what now I get as the internet infrastructure without having any idea what I was doing. I think I built a retail PHP app, so Monica Lewinsky could sell handbags at one point. I learned about databases in PHP. I feel like I just ended up following the infrastructure down.

Ars Technica got pretty big. So I got to build some interesting backend stuff there. We were doing things like MongoDB replicas before anyone knew what MongoDB was. I tried to build a content management system for that, which is the worst product on the planet. Nobody should build a CMS as far as I can tell. Or I shouldn't and somebody else is just much better than me.

Then I got to the point where I wanted to do my own thing. So I went to Y Combinator in 2011. I ended up joining a company called MongoHQ during Y Combinator, and we did hosted Mongo databases. Most of what I'm doing now just came directly from that. We basically hosted several 100,000 Mongo databases, built a pretty decently successful startup, sold it to IBM, did a bunch of other databases. I left and I was like, “I kind of liked that job, so let's do it again with a different take on things.” That's how we got here to where we're buying physical servers and putting RAM in them and GPUs and things. It’s very, very interesting.

Jean Yang [00:03:05]: Cool. Yeah, we rarely get to chat with people who are working with physical hardware these days. Some of my favorite war stories are AWS people who were shipping their data on trucks because it was faster than transferring it on the wire. Or I don't know if you remember during Hurricane Sandy, I would watch which servers were going down at the same time, and what was going on for the cloud. But we would love to hear about what it's been like buying, racking, and running servers and real data centers. It's such a rarity these days.

Kurt Mackey [00:03:37]: Yes. Actually, between somewhere in that history, which is non-complete, I worked for a company called Server Central in Chicago. We were basically doing colo-shared hosting dedicated servers, kind of the stuff that you needed in 2006 to do anything. I don't remember which hurricane it was, but there was a hurricane where some data center company was live streaming from downtown New Orleans during – it might have been Sandy, except from downtown New Orleans, trying to keep their servers up. It was not my war story, but it's a really good concept.

Jean Yang [00:04:05]: Yes, the servers during hurricanes. I don't know. I didn't realize anybody else watched this.

Kurt Mackey [00:04:09]: It’s a whole genre, I think.

Jean Yang [00:04:14]: It’s fascinating.

Kurt Mackey [00:04:15]: Yeah, it's super interesting. My war story for that point was we had a lot of servers that there's a giant data center provider called Equinix that if you've heard about servers, you've probably heard about Equinix, whether you know it or not. They built these data centers. So they're supposed to be on overlapping power grids, and the idea is that no single thing could fail and turn off the power. It's like overlapping power grids. They've got redundant generators. They've got uninterruptible UPSs.

I can't remember exactly what happened. But, obviously, the whole data center went down. The thing that's not supposed to happen happened. What I remember from that is it took two days to actually recover people’s stuff because rebooting things is hard, and it never works the way you want. When you've had servers that have had 10 years of uptime, and they reboot, bad things happen. The most vivid memory I have of that is actually what we ended up doing was a lot of work and then a lot of sitting around waiting for an alert or someone to come and be like, “My stuff's still broken. Can you go look at it?”

We sat in the conference room. It was 4 am. I was delirious. We watched Harold & Kumar Go to White Castle in the Equinix data center conference room. Yes, for some reason, I'm flippant about this. But I think what's interesting about the company we're doing now is a lot of the reason we can do physical hardware is it seems approachable because we've done this before. I think that, in some ways, I'm lucky to be a middle-aged internet person for that reason.

Jean Yang [00:05:34]: What's it like today? What's changed? What's easier? What's still hard?

Kurt Mackey [00:05:39]: The biggest difference by far is just the size of the servers you can buy. It's actually – if you imagine even a high-volume application, we're putting in servers with something like 35 terabytes of NVMe, two-plus terabytes of RAM. They have 128 CPU cores. That's eight startups. That’s more than most companies ever need. That's kind of interesting and fascinating because what it does is that, actually, our company runs servers all over the world. So we're in 37 cities right now. But what it does is it means we don't have to go buy a whole floor of a data center to actually have a good footprint. We have to buy – literally three servers in some places is really all we need, which makes this possible. It's like we're fortunate because when it's expensive to set up data centers, you can't do a company like this.

The new hard thing I've learned is entirely about logistics and customs and customs in different countries and things like if you ship a server to South Africa with FedEx, it will never arrive because South Africa FedEx and US FedEx are two different companies, and they don't seem to talk. You end up having to do very specialized shipping to get things places. If you ship things to Brazil in a certain way, you end up paying more than their value in taxes because a lot of countries don't want American companies to sell things of those. They'd rather have – it makes sense.

The most interesting thing about servers right now is just pure logistics and things like we ship individual servers as different packages just in case one gets hung up at customs, maybe the others will still get through. That's that kind of like physical resiliency. That's kind of an interesting set of problems.

David Mytton [00:07:08]: You shipping people as well to rack them, or do the data centers do that?

Kurt Mackey [00:07:11]: The data centers do that. We work with a couple of companies, and then they work with companies that have – they call this “remote hands”. So in some countries, the data centers will do that. In some countries, they actually have contractors, they hire directly. I have not seen any of our servers in real life, which is weird because it's all sort of abstracted through a few layers of humans. It's like the truck with data. It's like very slow self-service, basically. It’s like Mechanical Turk cloud.

David Mytton [00:07:37]: How are you deciding which locations to go to? Are you finding that some are much larger than others? You mentioned just three servers in some locations.

Kurt Mackey [00:07:45]: Yeah, we’ve learned a lot. I feel like we have a lot more to go there. I've been surprised what big regions are. So we're getting a lot of people who want to deploy in Germany because of German privacy laws. So Frankfurt is top three regions for us, I think. We get a shocking number of people in Sydney because the people in Sydney, one, it's an English-speaking country. Well, Sydney's a city, but you know. Australia's English-speaking. Developers in Australia aren't used to having Heroku, for example. Almost nobody's built good developer UX in some of these countries that are already English-speaking so it's very intuitive for them when we happen to work there.

The most interesting things I've learned is where fiber runs in the world. So a lot of the reason we go into different cities is because we want a human being to use one of our customers’ apps and get less than 20 millisecond round trips to servers. South America's not well-interconnected. So Sao Paulo is not really connected to other cities in South America, like Santiago, Chile. And Sao Paulo you probably have to route through Miami to send a packet between those two cities.

What happens is we end up having to over – I’d call it “overbill”. We need a lot more cities in South America than we needed in the US because the US is heavily interconnected. We can be in two cities on the East Coast, and that's lots of millions of people that have less than 20-millisecond access. But in even Brazil itself, we actually have to be in three or four cities to get substantial coverage of the Brazilian population in under 20 milliseconds, which has been incredibly interesting to me.

David Mytton [00:09:12]: Yeah, I've been to the data center in Miami, the NAP of the Americas, and seen that fiber cable that connects down there. This was 10 years ago, so I'm sure there are more now. It was funny, just looking at that cable.

Kurt Mackey [00:09:25]: They're pretty cool.

Jean Yang [00:09:27]: I love seeing physical manifestations of the internet. When I was an intern at Google, we would hear these stories of someone was digging in the ground, and their shovel hit one of the cables, and this whole data center went down. I thought it was very funny.

Kurt Mackey [00:09:41]: Yes. Oh, digging is the worst. Digging takes out so much of the internet. Just nobody should dig as far as I can tell.

Jean Yang [00:09:46]: Yes. But I want you to know, when I was a kid, I just used the internet without thinking about where did the cables go.

Kurt Mackey [00:09:52]: Yes. Equinix makes this a marketing thing. So you can probably get a tour of Equinix without a lot of work. You go in there, and they have this whole sophisticated security thing you go through. They call it a “man trap” because I guess gender. But you have to do a handprint to get into an airlock, like a security airlock, and then another handprint to get into the data center. Then they have all the lighting setup to be as dramatic as possible. You'll walk by cages of servers and we have the lights off in there for privacy. The companies don’t be looking at their blinking lights or whatever.

The funniest thing is then you go to the back door and all pretension is gone. It's like this is the place trucks arrive with equipment, nobody goes through a mantrap with a stack of servers. It’s a totally different concept. Everybody loves looking at the glamorized version of infrastructure. I do too. It's very cool. They have great marketing photos.

Jean Yang [00:10:42]: An obvious question for you is why not just build on top of AWS or one of the other hyperscalers? I distinctly remember reading The Everything Store about how they built their infrastructure. My first thought, which I'm embarrassed to admit, was, “Why didn't they just build on top of– Oh!”

Kurt Mackey [00:11:00]: Yes. Right, exactly. There's a lot. We actually get customers like that. Bookshop.org will not run on top of Amazon, just because why would you? You're sending them money. It was actually a small decision, which was when we first started this effort— The original thesis for this company was there's not really any good CDNs for developers. If you could crack that, it'd be very cool. The first thing we needed was servers in a bunch of places and a way to route traffic to them.

What we wanted was AnyCast, which is kind of a part of the core internet routing technology. What it does is it offloads getting a packet to probably the closest server, to the internet backbones almost. You couldn't actually do AnyCast on top of the public cloud at that point. I think you can on top of AWS now. So we were sort of forced to figure out how to get our IPs, we were sort of forced into physical servers for that reason. For a couple of years, it felt like we got deeply unlucky because we had to do physical servers. You’d talk to investors, and they'd be like, “Why aren’t you just running on the public cloud and then saving money later?” Then last year, that flipped. Now, we're very interesting because we don't run on the public clouds.

I think the big advantage for us now, and the reason I'm happy we lucked into this, is that we have a lot more control over how we spend money. We have a lot more ability to do interesting things and then be profitable without having to price things in a way that's obscene. So our prices are actually pretty close to what you get from AWS, despite us being much smaller, because we just took out that whole layer of margins, basically. But it is a tremendously complicated, difficult thing to do, particularly when you're doing it in multiple regions. So I still think it’s the right idea, and I'm glad we're doing it. I still – Sometimes, I'm just very unhappy about we're having to fight.

David Mytton [00:12:42]: Why do you think developers love these types of platforms? If you've gone – I suppose back when you started a few years ago, if you'd asked me “Well, we're going to build an alternative AWS”, I would have been like, “No way yet. AWS has got it sorted, and Azure and GCP definitely if you're not on AWS.” But we've got Vercel and Netlify and all these types of platforms, Fly.io, of course. Why do people love them so much?

Kurt Mackey [00:13:06]: So I think people like things that were built for them just at a core level. So I think if you surveyed the population, you'd find a bunch of people who are on the internet in 2005 as developers really like Fly because we kind of built this thing for us, and that's what we were. So it's intuitive the way this all works for us, and I think that's a start.

I think one thing I started thinking about recently is we don't say “edge”, but we're kind of edge-computing. We don't really say “serverless”, but we're in that area. Vercel’s the same. I don't know if they say “serverless” much, but it's kind of serverless.

Jean Yang [00:13:36]: Yes. I would consider them serverless.

Kurt Mackey [00:13:38]: Yes, they're definitely serverless.

Jean Yang [00:13:38]: Sorry, Vercel, if that’s not what you consider yourself.

Kurt Mackey [00:13:41]: I'm not sure they lead with that with marketing, but it's definitely serverless. We're arguably serverless, and we never say “serverless”.

I think one of the things I've started to notice, as I've looked at these companies, is a lot of where things run and how you build for them. Serverless is almost more of a building construct than it is an underlying thing. But a lot of this is almost the extension of the UX down into the infrastructure. I think what we're all trying to do is figure out how people are going to want to buy computers in 10 years. I think what the traditional public clouds did is they took a data center, and they've made it software. They basically – most of AWS works like a data center did but with APIs instead of humans, basically.

I think it goes deeper than I expected, which is people like it because it works the way they expect. I also think there's probably something deeper there about how a cloud should actually work. That's a big extension of the UX. I think that some of the things we're able to do help people build different types of applications. I think we're all just hoping that we're onto something big here, and we've kind of caught the right wave. But it's a – I also think about this a lot. It’s a very curious question. I was not saying compete with AWS a few years ago because nobody would have bought it. So maybe gotten to a loudness on the internet where we can get away with that now.

David Mytton [00:14:54]: Yeah. And AWS being basically a data center as an API is kind of reflected in how complicated everything is, and there's five different ways of deploying just a container. That’s as one of their services.

Kurt Mackey [00:15:05]: Yes. I think that, well, AWS is incredible. And EBS and S3 and even EC2 were astonishingly good products. But I think there's an element of that where it's not always the right primitives, just based on where the world's going.

Jean Yang [00:15:20]: Tell us more about Fly.io itself. So you mentioned it was for people who were on the internet in 2005. I would say I’m such a person. But how would you describe such a person more specifically? What are they looking for? What kind of simplicity or what kind of deployment model?

Kurt Mackey [00:15:35]: I think that we made the choice to be CLI first because we all liked CLIs. In fact, you can't even use most of the products without the CLI.

Jean Yang [00:15:41]: Check for me.

Kurt Mackey [00:15:42]: Yes, exactly. Everyone from 2005 is like, “Obviously, I don't want to click things. It's not going to do what I want. I'm going to script this. I have Bash, right?” I think that's in there. I think that we also– One of the things that we did is we made it— We've tried to make it easy to like punt from our abstractions. So at the root of it, you can actually just get rid of all the stuff we do that's magical for you and do it more lower level. I don't know if that's because we're from 2005 or because people from 2005 have enough experience to try and take advantage of that. But I think that's the – somebody called this Gen X infrastructure, which could be insulting, but I think it's kind of funny. So there's a lot of that.

One of the things I think is happening is we've actually noticed a big divide of developers who want to think about how many CPUs they're using, basically, and developers who don't. So you could call us “Gen X infrastructure” because we tend to attract developers who want to know how many CPUs they're using. But I think what we're seeing actually is that we kind of attract more backendy developers that want to have a database. The number of CPUs actually influences how they build their software. It's like actually kind of a core part of the environment for them.

I think that we have a lot of people that come deploying Fly, and they're like, “I don't know how many CPUs I need for this. Can you just take care of this for me,” are probably frontend people who are working their way back who've just never had to think in these terms before. It doesn't add value to their lives when they have to. So I think the people we're attracting are probably closer to back end. I think that a lot of people from – there wasn't a front-end in 2005. It was like it was it. You ran your app processes and your databases in a server HTML, and there you go. So these are all things I like to think about.

Jean Yang [00:17:16]: I still make web pages with an empty page in HTML, and my web pages look much worse than a lot of pages on the internet these days.

Kurt Mackey [00:17:24]: Right, but I feel good about it!

Jean Yang [00:17:27]: I feel really like those people who farm all their food or something like that.

Kurt Mackey [00:17:31]: Yes, that’s exactly what it is sometimes. But I just want to keep reiterating, it's a very big market for anyone who’s listening, so.

Jean Yang [00:17:39]: Kurt, here's something I wonder if you have a thesis on, because the whole story of computing is about abstraction, moving up the stack. Everything's more automated, where abstract memory is cheap now, CPUs are cheap. So what's the story for going back and exposing control over CPUs and things like that? What's your thesis around who wants that and why is that a necessary part of what gets exposed to everybody?

Kurt Mackey [00:18:04]: That's a good question. I think that one of the types of customers we've started attracting are the people who build platforms. So there's probably 10 companies on top of Fly right now who would like to be building Vercel. So I don't know if it's a shift as much as there's a market, I think, for the platform builders, and doing it the way we did might be attractive to them. I think there's another thing that we've probably all reckoned with since 2011; a lot of the abstractions were wrong.

Jean Yang [00:18:31]: I agree.

Kurt Mackey [00:18:32]: Yes. Okay, good. As the front end got more powerful, I think we tried a lot of different things for— and what we ended up doing was inflicting this weird distributed systems problem on frontend developers. So I think that, in some ways, we just have the luxury of ignoring a lot of things that people have been trying to figure out for 10 years because we probably think that's wrong at this point. So we happen to be doing well at a time when server-side rendering is all the rage in a frontend community, which is perfect for us and nobody really cares about shipping static files around in the same way.

I think it's just an evolutionary. We kind of have a different idea of what's right now and can do simpler things. Then we'll probably get big and complicated in 10 years and have to be in the same situation again.

Jean Yang [00:19:12]: Yes, that makes a lot of sense. Different abstractions for different times.

Kurt Mackey [00:19:16]: Yes, exactly. It's like a pendulum. The thin client problem is the thing we're just going to all be trained to reckon with for the next 40 years, the same way we have been for the last 40.

David Mytton [00:19:25]: Do you have a view then on what developer experience means? Like that focus on the CLI first, you've started to build a web interface. What's your approach to that?

Kurt Mackey [00:19:33]: We are, I think, fortunate. The nice thing about having customers is you just have to stop making up answers to questions, sometimes, and just do what they ask for. We're getting – a lot of our customers want to use the CLI less basically. A lot of them want to start with GitHub and launch and have a UI email-based view of the infrastructure. I think that we're going to do that, and it's not a giant change. It's more of an expansion of focus than a change in philosophy, if that makes sense.

The CLI was helpful for us because we couldn't do more than that. Obviously, I would have loved to do all the things people wanted at the same time. The thing that is really — I don't know if all of your podcasts are turning into AI at the moment — but the thing that's really interesting to me is, you know Simon Willison, he created Django dataset? His work in particular, exploring how devs can use AI to do things, is astonishing to me.

I think there's probably – this is when I start to feel old and curmudgeonly. I'm like, “I don't know if I have the brain plasticity to actually adapt to this in the way that makes sense.” I'm like, “I don't know if I can actually absorb this and make any good choices with it because it's so fundamentally interesting and different and maybe a big nothing burger.” But I feel like that's the stuff where we're seeing a lot of demand for not us to do AI stuff. But we're seeing a lot of people build really interesting things with its ChatGPT in particular, which I think is just utterly fascinating. That might have diverged from your question there, but we have to talk about AI, so.

David Mytton [00:20:57]: Yes. I think everyone mentions ChatGPT now in every episode.

Kurt Mackey [00:21:01]: Yes, exactly.

David Mytton [00:21:02]: Is that why you're adding GPUs?

Kurt Mackey [00:21:04]: We are. We've gotten a ton of – I think one of the interesting things that's happened with models is that they've gotten fast enough that people have started to care about latency between the user and the server. So if we can shave 200 milliseconds off a model, this has gotten valuable. So we're adding GPUs just to – a hypothesis is people want to do inference close to their users, the same way they want to run CPUs. So far, that seems right and exciting. But that's basically why.

But it also feels like it's that whole “sell a pickaxe in a gold rush” thing. We’re not actually doing anything with this, we’re just selling what the Silicon people happen to want. We also have to figure out to build a relationship with Nvidia, which is a new interesting problem that I have never had to think about before.

Jean Yang [00:21:43]: Good place to be.

Kurt Mackey [00:21:45]: Yes.

David Mytton [00:21:45]: This is a question I like to ask people when they talk about the collection of these different services. Back when I was building applications for the back end, you'd put the database as close as possible to the application on a really low latency, probably a physical server next to or on the same one. But now, people are used to doing an HTTP fetch call over the internet to a database, and that can't be fast.

Kurt Mackey [00:22:08]: It's not.

David Mytton [00:22:08]: How do you think about that?

Kurt Mackey [00:22:09]: Or it's variable. Actually, so the physical hardware is hurting us at this point. So if we were on AWS, it'd be very easy for us to find Crunchy Data or PlanetScale or any of these people as integrators. The problem we have right now is most of our devs still want that zero-database latency and we've created a complicated problem by also having people who run apps in 37 regions. So to actually deliver a good managed database service on Fly, you need to be on our hardware, or you need to be in the same data center. You need to be on your own hardware, and you could make this work. Then you also need to be working in 37 regions.

I think that we're pretty stubborn about this. We've had a bunch of that are like, “Well, [we need to integrate, we’ll just run on AWS.” I feel like that would be a poor user experience at very bad times. So the thing about that HTTP request you're talking about is it might be fast most of the time until it's not. It actually gets incredibly difficult to debug what's happening. If we can keep the network somewhat local, we just get rid of a whole class of problems that are hard to debug, just by nature of keeping things simpler.

We think about that problem a lot. I think that this is, again, our frontend-backend difference that the developers we attract want to just connect to their database over TCP. The frontend developers want a URL that they can plug their app into that just works. I think these are both equally valid, we’re just kind of optimizing for our set of users.

David Mytton [00:23:26]: That makes sense. So tell us a little bit more about the technology then that makes all this possible. I think you're using Firecracker VMs, and you're in the process of removing Nomad in your own system that you've built for scheduling.

Kurt Mackey [00:23:38]: Yes. We are using Firecracker, which is an advanced physical hardware. It's difficult to use Firecracker on top of clouds. On GCP— So, Firecracker is a thin layer on KVM, and so it's very lightweight, like kernel-level virtualization that boots very quickly, which is good for us and good for the customers we've found. You can run this on GCP, but then you get nested virtualization, and it's actually very slow and expensive. You can run it on AWS, but you need to buy their bare metal servers, which are incredibly expensive. This whole topic is going to be like, “When are physical servers good, and when are they bad?” Firecracker is a time when physical servers are good.

What we do is we treat each host independently. We have a little service that manages the Firecrackers on that host and manages the volumes on that host and manages networking for those. The networking layer is interesting because it involves a BPF, which should check someone's buzzword bingo probably.

Jean Yang [00:24:25]: Love BPF. We also use it.

Kurt Mackey [00:24:27]: Yes. I know, right? It seems like it's a very topical thing. Our version of BPF does something different than your BPF cycle. Our networking layer is actually not even— It's like the minimum viable networking abstraction where we have a wire guard overlay that sits on all of our hosts, it’s this mesh wire guard. Any two hosts can talk over an encrypted path. Then basically, we just fiddle around with IPv6 prefixes to create private networks. So we do some interesting like – we'll rewrite IPs with BPF as packets go through.

We learned about checksum errors. I can't remember the name of it, the CRC, or checksum on the packet itself. If you actually just change two octets of an IPv6, you don't have to worry about recheck sending packets, which is good because that's really brittle and complicated when you have to recheck some packets.

Jean Yang [00:25:12]: Do you save a lot of compute that way too?

Kurt Mackey [00:25:13]: It's less compute. We don't know. I think it's less compute and more just that things break in weird ways when you have to actually update packet checksums. It's very difficult to debug.

Jean Yang [00:25:23]: Okay, interesting.

Kurt Mackey [00:25:24]: We save less brain compute, technically.

Jean Yang [00:25:26]: Yeah, I never thought of that part. That's interesting. Yeah.

Kurt Mackey [00:25:29]: Yeah, we stumbled on that one. Then we have a global proxy. In some ways, the joke I make is we're actually a global proxy company that just monetizes by selling compute. Because the thing that takes a user's connection, turns into an HP request, starts a VM if it's not started, and then routes it to like the fastest possible option is actually the most interesting thing we've built, I think.

Jean Yang [00:25:52]: What have been the driving forces or the overarching vision behind the networking? Because it sounds like you've done some really interesting networking stuff. Did you do it out of necessity or because there's a thing you're driving towards?

Kurt Mackey [00:26:05]: I think we've done it incrementally for different reasons. So one of the first things we launched was actually a JavaScript runtime that was like Cloudflare workers. But I had an image API on it. So you could resize images, convert them to webP, kind of do heftier compute. One of our first customers using that  melted our servers in Tokyo. I think a routing change on the internet sent too much traffic to Tokyo. Then we basically built a proxy feature to do, we called it “latency shedding”. So the idea is if Tokyo is too full for a given app, we'll send you to Singapore, or Hong Kong, or wherever your app happens to be fast enough. We needed that because we had a crisis and this was the right way to solve that particular problem. That's actually very powerful now. We have this feature where you can basically turn on and off VMs all over the world based on where your traffic is. It's almost entirely based on that latency shedding that we almost entirely had to build because two servers were melting in Tokyo.

I think the internal networking stuff, we have a kind of a belief that it's 2023, and you shouldn't talk to private services over the public internet if you can help it. Everyone is better off if you just default to private networking. We ship the minimal viable version of that, I think. It was like, “This is what we believe this should be, and we're not going to do it a different way so that's the quickest thing we can do.”

Then I think some of the more interesting features are, and I didn't realize this when we ran a database company, but we do have a bunch of people with databases. So we have an automated Postgres service that's not managed. It's pretty good for some things and not great for Heroku users, basically. But one of the interesting things about the database and the proxy being a thing we both control, or we control both of, is that we had this– One of the– Back in 2020, the ultimate goal was run boring Rails apps on Fly. The idea was if you can make a boring Rails app work all over the world without creating a bunch of angst for a developer, it's kind of a cool infrastructure.

We ended up solving that with the proxy because what we did is we realized, most rights to databases happen in the context of a single HTTP request and if we create a writable version of a database in Chicago, and we run read replicas all over the world, and a request comes in in Sydney that tries to make a right, we can basically replay the request back in Chicago from the proxy and make it reasonably transparent to developers. So I think an increasing number of features are that, like what's an interesting way to solve a problem?

Jean Yang [00:28:19]: Kurt, here's a follow-up question that may reveal my ignorance about proxies. But this sounds like a lot of proxying, which creates challenges for doing things quickly. So have you done some magic there to make things work?

Kurt Mackey [00:28:34]: We have. It depends on your version of “quickly”. The kind of “quickly” problem we've had is almost all state propagation and inconsistent state. I think that we aren't really sensitive to throughput issues, for the most part. I think that we've actually – I think, over time, our proxy will just get better at throughput because it just is improving continuously.

But the state propagation has been the bane of our existence because you get this situation where it's like when you get a request in Sydney, you have to sort of guess. You have an idea of where VMs are running that can handle that request. But by the time you've sent the packet that's wrong, it's physically, eventually consistent. You cannot possibly make a good choice about where to send a packet from Sydney if the servers are 100 milliseconds away.

Almost all of our proxy issues or kind of challenges or scaling issues or even reliability issues have been things behaving weirdly when state got out of sync. Or even if state was out of sync for 10 seconds, them behaving weirdly. So I think the proxy is an exercise in reconciling unknown state in a way that people don't notice.

David Mytton [00:29:38]: You've also done quite a bit of work on some very specific technologies publicly like LiteFS for SQLite, and you've been very vocal around Phoenix and Elixir apps. What's your approach to that?

Kurt Mackey [00:29:50]: When I'm talking about this, the whole thing is like a happy accident. Phoenix was a happy accident where we had built this interesting– The problem with the cloud is it's good for everything, which is great until you can't attract users because nobody knows what they need it for. We discovered that Phoenix was a very good fit for the infrastructure rebuilt in 2021, I think. We're like, “Cool, let's just start focusing efforts around Phoenix.” In some ways, it's like we just decided to help with the framework. We decided it's kind of like DevRel marketing, but I think it's more– I feel better about it than I would a lot of DevRel marketing. We’re like, “Let's just go work with the Phoenix community and create content and help and let them have a good place to deploy their apps.” That's worked pretty well. We're fortunate because it's scalable. There's a lot of full-stack framework communities that we can replicate this with.

LiteFS is interesting because we observed– So we created this infrastructure where you could run a Node app with a Postgres and a Redis, which is how you build a Node app, usually. You can scale it to 17 regions, which means you suddenly have 17 Postgres and 17 Redises and 17 Node processes. We're like, “Wow, that's complicated.” This fails in ways people don't expect. So what got us the LiteFS was these people don't have huge volumes of traffic. We kind of have this idea of an app that you deploy that continues to work in 18 months. We don't think that 18 Postgres, 18 Redis, and 18 – I don't know what the plurals are for any of these. It’s going to sound really goofy. — But I don’t think–

Jean Yang [00:31:10]: “Redi”

Kurt Mackey [00:31:11]: Yeah, right? Exactly. We’re talking about 36 database servers and 18 app servers, basically. There’s no chance of those working in 18 months because one of those things will override a RAM just by normal course or whatever. So in a lot of ways, the LiteFS and SQLite is we think maybe this is where a lot of these apps will go in the future where it's closer to an embedded problem, where you have an app and data co-located. It's much simpler with far fewer moving pieces, and we just wanted to come up with some tools that would make that easier, basically. I built everything with LiteFS now. It's great. I love my SQLite.

Jean Yang [00:31:41]: Cool. Related to your comment about developer marketing, it seems like your marketing has been super effective at Fly. People love the long tactical blog posts, they do well on Hacker News. What has been your approach to reaching developers?

Kurt Mackey [00:31:59]: That's been hard and iterative. I think that the last developer-focused company I did in 2011 was at a time when devs would try anything because it was all very interesting, and there was like –

Jean Yang [00:32:07]: There wasn't much to try.

Kurt Mackey [00:32:09]: Exactly. You could actually keep up with pretty much everything that was happening and give it all a good go. This time around, people are just fatigued. It's much harder to get devs to actually spend time on anything. So it's interesting because we were better at content than we were at our product for a bit. So back before we launched the current product in 2020, we would get on Hacker News and then not get any users, which I thought was kind of interesting because we just write about what we find is interesting. We've all written enough at this point where it tends to kind of by default sound okay to people. We have experience with Hacker News in particular.

Hacker News is like a home-run successful article thing, but it also is like the thing that propagates content out to other places. So I think we're lucky because we can lean on that so hard. I'm also nervous that it's like a single thing that we have to rely on that much. I think the neat thing for us has been we were pretty good at being loud on the internet. We didn't have the right product. One of the ways we knew we kind of landed by accident – not accident but landed on the right product is we had a thing go up on Hacker News and a whole bunch of people signed up after they saw it. I was like, “That's interesting. Maybe that's how that's supposed to work?”

Jean Yang [00:33:10]: That's great. Yes.

Kurt Mackey [00:33:11]: Yes. It was very exciting. I just think we remember that from February of 2020. Now, we're just continuously trying to replicate that, which is fun. But we have a whole style guide for writing. It's a lot about– It turns out to be like how to write interesting stuff for developers. One of the things is, “Swear early just to loosen up.” You can take it back out later if you want. But a lot of this is trying to lower the bar to writing for the most part.

David Mytton [00:33:33]: Well, before we wrap up then, I have two lightning questions for you.

Kurt Mackey [00:33:37]: Sure.

David Mytton [00:33:38]: First one is what interesting DevTools are you playing around with at the moment? Or tools in general if you prefer?

Kurt Mackey [00:33:43]: I'm waffling between – Elixir is new to me still. I've never really built a business on Elixir. I've built internal tools in Elixir now. I'm fascinated with Elixir and Phoenix. I think that having a full-stack framework that has native clustering is actually super interesting. I'm not used to having app servers that just know how to communicate with each other without any configuration, which is very cool.

Then the other one actually, Deno’s Fresh, or “Deno”. I can't remember what we're calling it today. The Fresh framework is kind of neat. I like Remix and Fresh a lot. I think that what's cool is to see people's takes on how full stack should work. I think that Phoenix is one take, and Remix and Deno’s Fresh are kind of a different take. I like them both, and it's kind of cool to see them evolve in parallel.

David Mytton [00:34:24]: What about your tech setup then? This is the second question. What do you use on a daily basis, hardware, software?

Kurt Mackey [00:34:30]: I have one of the M2 MacBook Airs, I think.

Jean Yang [00:34:33]: M2.

Kurt Mackey [00:34:34]: Yes, right. I just had to get that out. But I feel like the ARM MacBook was just like, well, I guess I don't need to think about this anymore because the battery lasts all day, and it's basically fast enough for what I need to do. The more interesting tech bases, we have servers with eight Nvidia A100 GPUs in them that are very interesting to mess with. I don't get to use it every day, but I do get to fiddle around with them. So that's more exciting than the ARM MacBook, I think.

David Mytton [00:34:56]: Very good. Well, unfortunately, that's all we've got time for. Thanks for joining us.

Kurt Mackey [00:35:01]: Thank you. It was nice to talk to you all.

Jean Yang [00:35:03]: Yeah, this was super fun. Thanks, Kurt.

David Mytton [00:35:06]: Thanks for listening to the Console DevTools Podcast. That’s it for this season. Don't forget to subscribe and rate us in your podcast player. If you're playing around with or building any interesting DevTools, please get in touch. Our email is in the show notes.

[END]

David Mytton
About the author

David Mytton is Co-founder & CEO of Console. In 2009, he founded and was CEO of Server Density, a SaaS cloud monitoring startup acquired in 2018 by edge compute and cyber security company, StackPath. He is also researching sustainable computing in the Department of Engineering Science at the University of Oxford, and has been a developer for 15+ years.

About Console

Console is the place developers go to find the best tools. Each week, our weekly newsletter picks out the most interesting tools and new releases. We keep track of everything - dev tools, devops, cloud, and APIs - so you don't have to.