Born out of the problems that developers face building secure applications, Anchor was built and designed by a group of developers who have fought these problems for years. This three-part series will explore these problems, what ideal solutions look like, and how Anchor provides those solutions. In our first post, we share our take on the current landscape of security solutions, and why the time is ripe for something new.
The last 15 years of my career were spent working on application hosting platforms like Heroku, helping customers large and small with app scaling, architecture, platform outages, and security incidents. Handling security incidents was often the most harrowing, and involved conversations with engineering leaders, managers, and CTOs. These customers were deploying directly to the big 3 infra providers using Kubernetes, Docker, and a slew of orchestration tools popping in and out of existence. In these conversations, I heard three common perceptions about security: implementing proper security is difficult; VPNs are a panacea; and execs are increasingly concerned with limiting the blast radius of potential incidents.
Difficulties Implementing Security
When developers say something is difficult, we often mean that the problem space lacks developer-centric solutions. It is typical for engineering teams to ignore in-depth security problem areas until a security team hands down a requirement. The lack of elegant developer-centric security solutions means that engineering teams either absorb significant overhead while implementing a solution, roll out a half-baked solution to appease those in power, or ignore the requirement altogether for far too long.
Further, good operational hygiene is rare. Few teams can reliably roll encryption keys across their infrastructure — even when they can roll API tokens and other credentials! If you can’t automatically rotate keys after an incident or exposure (Spectre/Meltdown, Heartbleed, Downfall, etc.), you probably don’t rotate them when offboarding an employee.
At their core, these issues are due to a lack of tooling for developers.
VPNs, VPCs, & Blast Radius
A common first step in implementing security is applying blanket policies by adding encryption to underlying network traffic and at network boundaries. Products like AWS VPC and modern VPNs make this seamless and largely invisible to application developers. This is an effective practice. However, relying solely on these solutions has a major downside: network breaches.
We’ve seen network breaches play out in the real world far too often, with rideshare companies exposing customer location data, and code hosting companies leaking private source code. When these breaches happen, the blast radius can be extensive and can do lasting damage to the brand and, most importantly, the humans involved.
Dropping in a VPN or a VPC may tick a compliance checkbox, but it doesn’t provide blanket security, and is inherently at odds with limiting blast radius and exposure. In today’s world, a multi-layered approach to application security is a must.
Looking at the Security Landscape: Behind The Firewall
As application developers, we own the application stack and implement security controls at that layer. We often look to our ops friends, infra providers, and security teams to manage security at other layers of the stack. One of the controls we have in the application is ensuring our apps are encrypting all service-to-service communication.
At the application layer we have a multitude of encryption options.
Custom Solutions
There’s been a burst of products that implement a novel encryption solution or repurpose a popular protocol, like SSH, for application traffic. They can offer great security properties and offload much of the cryptographic expertise needed to build these systems, but that often comes at the high cost of deep application integration. Spending development cycles on integration is risky and, because there are no open standards here, is vendor specific.
Service Mesh
One possible middleground between opaque network layer encryption and deeply integrated application layer encryption is the service mesh pattern. The surge in service mesh products and services points to a real need for an application layer solution without the cost of deep application integration. They simplify open encryption protocols like TLS, which are typically fraught with complexity and operational costs for internal deployments. A well-tuned service mesh that provides encryption, key rotation, and least-privilege access can feel like magic.
Unfortunately, the service mesh pattern is no silver bullet. These architectures suffer from a “middlebox problem”: they need deep knowledge of the workloads they are serving. Take gRPC streaming: it requires end-to-end HTTP/2, so application developers must jump through hoops to get around the lack of end-to-end TLS. As the network architecture grows in complexity, the cost of configuring and operating these service mesh middleboxes grows larger.
Shoehorning WebPKI
If a middlebox approach is not the answer, is there a solution that builds on standard protocols like TLS, avoids the high operational burden of traditional internal TLS deployments, and needs minimal application integration thanks to broad language support? Perhaps WebPKI can fill the void here: recent developments like the ACME protocol make procurement of publicly trusted TLS certificates easy and — crucially — automatable. For all of its faults, encryption on the public web works pretty well for both browser and public API traffic, and proves that large scale TLS deployments are at least possible.
That said, shoehorning WebPKI for internal encryption has big drawbacks. First, the public CAs have a high bar for proving domain ownership and require new DNS records or publicly accessible HTTP(S) endpoints, which means extra operational costs. Even after taming these extra requirements, the public CAs come with low API limits and very little control over certificate lifetimes, names, extensions, and other configuration. To top it off, every certificate provisioned from a public CA ends up in a public Certificate Transparency Log for the world to see.
ACME
If custom solutions, service mesh, and shoehorning WebPKI into your backend infrastructure to provide TLS all come with their own drawbacks, where does that leave us? The TLS protocol can evolve (see TLS 1.3) and we can improve security without the high cost of switching to new protocols. Languages and libraries already have great support for TLS, which means risk and integration cost is very low. All signs point to ACME as the cornerstone of the best solution.
Running your application over HTTPS has historically been notoriously difficult. ACME was introduced to the world by Let’s Encrypt about 8 years ago, with the goal of increasing encryption on the public web. Today, it’s extremely rare to find sites not using HTTPS. By any measure, Let’s Encrypt and the ISRG have been hugely successful in their mission. But their purview is focused on the public web, and things aren’t as rosy when we look behind the firewall.
The ACME Gap
While ACME solves the provisioning problem it doesn’t address the distribution problem.
At its core, ACME is a protocol that automates certificate provisioning, and it has rightfully become the industry standard. For all TLS deployments, certificate provisioning is now table stakes thanks to ACME. But internal TLS deployments can’t leverage the work of the Root Store Operators, who maintain the publicly trusted certificate bundles, to tackle the distribution problem as they do so well in the WebPKI realm. And recent advancements like Certificate Transparency don’t provide the same benefits to internal deployments. This leaves an unaddressed gap for internal TLS deployments. At Anchor we refer to this as “the distribution problem”.
The important insight taken from the success of ACME is that the problems with TLS deployments (especially internal deployments) are problems of tooling. We believe the distribution problem is solvable with better tools and services, without new protocols or application architectures.
Minding the Gap
Internal TLS deployments using ACME for provisioning have unbounded potential for delivering application encryption, but this ACME gap remains. Internal TLS deployments lack elegant solutions for managing client trust stores, lifecycle management, built-in key rotation, revoking certificates and keys, and automated renewals. These additional requirements need to be met in a developer-centric way that minimizes operational overhead and enables dev/prod parity for application encryption.
At Anchor we’ve put a lot of thought and work into closing this gap. In the next two posts, we’re going to describe some properties of a desirable solution and what we’re working on to address them. If you want to share your thoughts or learn more, come join the discussion in our discord community.