Services Runtime Security — Part 3

12 min readSep 22, 2023

A Standardized Security Approach

Thanks for reading the previous installment (https://number40.medium.com/services-runtime-security-part-2-9d2d6cc7e9bc) in this series.

Picture this: a realm where API security isn’t a labyrinthine maze but a structured, standardized process, ensuring seamless communication among services, robust defense mechanisms, and heightened operational efficiency. Sounds bonkers, I know.

Several architects and authors have, time and again, expressed the importance of such a streamlined approach. I believe the BlueMyst architects can achieve this by focusing on three pivotal areas:

Services Authorization: To ensure that data requests and exchanges occur only between authenticated and authorized services.
Inter-Services Communication: Ensuring that communication between different services is not only smooth but secure.
Services Traffic Analysis: Monitoring and analyzing traffic patterns to detect anomalies that might signal potential threats.

Here is normally where we would jump to some common wisdom and you might get an article that parrots some wonderfully written books and online articles. They are not wrong but sometimes they miss a crucial component, adoption. This article would be pointless if all I offered was the typical viewpoint. I want to present some of the common wisdom first then I want to give an alternate solution. The concepts and tooling I present are, in my experience, easier to adopt and give us defense in depth but focus on simplicity.

Note: This is, in my experience…there are different ways to accomplish our goals. This article is intended to walk you through the architectural thought process and expose the reader to an alternate approach.

Security Staples

The distributed nature of services and microservices means that there are many more “edges” to defend against potential threats. Numerous security controls and concepts have been popularized to ensure that these architectures remain robust and secure. Below are some of the most commonly used security controls and concepts in a services/microservices ecosystem.

General Concepts

Zero Trust Architecture: Do not inherently trust any request, regardless of where it comes from. Always verify before granting access or serving a request.
Patch Management: Regularly update and patch services, dependencies, and underlying platforms to address known vulnerabilities.
Access Control: Implement Role-Based Access Control (RBAC) or Attribute-Based Access Control (ABAC) to ensure that entities have only the permissions they require.
Immutable Infrastructure: Ensure that once a service is deployed, it doesn’t change. This helps in maintaining a known, secure state and can prevent tampering.

Authentication and Authorization

JWT (JSON Web Tokens): Stateless auth mechanism where the user state is stored on the client.
OAuth and OIDC (OpenID Connect): Standard protocols for authorization and authentication that work well in distributed systems.

Network Security

mTLS (mutual Transport Layer Security): Ensures two-way authentication between services.
Firewalls: Segmentation of networks can prevent lateral movement of potential attackers.
API Gateways: Manages and secures the interactions between clients and services.
Service Meshes: Tools like Istio and Linkerd provide security controls such as mTLS, authorization policies, and network communication controls for services.
Rate Limiting and Throttling: Protect services from DDoS attacks or malicious overuse by limiting the number of requests from a client within a given timeframe.
Secrets Management: Tools like HashiCorp Vault or Azure Key Vault allow for the secure storage, generation, and management of secrets like API keys, database credentials, and more.

Container Security

Image Scanning: Ensure container images don’t have known vulnerabilities.
Runtime Security: Monitor container behavior during runtime to detect anomalies.
Orchestration Controls: Tools like Kubernetes have built-in controls to manage which nodes can run which containers, pod security policies, etc.

Logging and Monitoring

Centralized Logging: Aggregate logs to central platforms (like ELK Stack, Splunk, or Graylog) to monitor for suspicious activities.
Distributed Tracing: Tools like Jaeger or Zipkin help in tracing requests through services, which can be essential for both performance tuning and security forensics.
Intrusion Detection Systems (IDS): Monitor network traffic for suspicious activity or violations.

API Security

Input Validation: Ensure that the inputs to your services are validated, sanitized, and properly escaped to prevent issues like SQL injection.
WAF (Web Application Firewall): Protects against web threats by filtering and monitoring HTTP traffic between a web app and the Internet.

Secure Service-to-Service Communication

Service Discovery Controls: Ensure only valid services can discover and communicate with each other.
Network Policies: Dictate which services can talk to which other services.
Data at Rest Encryption: Secure data stored in databases, caches, and filesystems.

By leveraging these security controls, organizations can fortify their services/microservices ecosystems against a wide range of threats and vulnerabilities. These are very good controls. The fight/argument usually is in the how. How do you implement these controls? The answer I know everyone hates but is unfortunately the best answer is: it depends.

One of our security architects has proposed a secure design that can meet our requirements. In this design we would use the API Gateway to enforce authorization. We will use mTLS to enforce service-to-service communication. Finally we will front all of this with a WAF so we can inspect traffic for threats. Not bad. But before jumping to implementation let’s pick on the design a bit and see if it has any flaws.

API Gateway

To tackle our first area Services Authorization, one of our security architect’s has proposed that we use the API Gateway as the place to perform authorization enforcement. Not bad, but let’s break this down.

The Case for API Gateway Authorization

API gateways provide a centralized point to manage security policies, which simplifies management, monitoring, and potential updates. By abstracting authorization logic away from the service itself and placing it at the gateway, microservices can focus purely on business logic, leading to cleaner and more maintainable codebases. With an certain API gateway technologies, you can ensure consistent application of security policies across various services, regardless of their underlying frameworks or languages. With all authorization checks happening at one layer, it becomes easier to centralize logs, making audits more straightforward and more comprehensive. Not bad actually but we may have introduced some issues, let’s explore.

The Downsides of API Gateway Security

While the arguments in favor of API gateway-based authorization are strong, it’s essential to recognize the potential pitfalls and drawbacks of this approach. Centralizing authorization at the API gateway introduces a risk. If the gateway is compromised, an attacker may gain unfettered access to all downstream services. Believing that the API gateway has security “covered” might lead developers to overlook or neglect security considerations at the service level, creating potential vulnerabilities. While an API gateway can manage high-level authorization, it might not be adept at handling fine-grained permissions at the resource or action level within a service. Some services might need more detailed access controls that a gateway cannot efficiently provide. As the first line of defense, the API gateway will bear the brunt of all incoming traffic. If not scaled or optimized properly, this can lead to performance issues or outages. In scenarios where services might be called internally (bypassing the gateway) or in hybrid cloud environments, relying solely on the API gateway for authorization can be problematic. In a microservices ecosystem, services often need to communicate with one another. If all authorization is at the gateway, service-to-service calls might bypass these checks entirely, leading to potential security issues. All of this and we have also added a dependency. If for some reason your service does not communicate through the gateway it is left insecure. This dependency also makes local development and the evaluation of security changes nearly impossible.

While placing authorization checks at the API gateway offers some advantages in terms of centralized management and simplicity, it’s not without its challenges.

Mutual Transport Layer Security (mTLS)

To handle our second area of Inter-Services Communication our security architect has suggested we use mTLS. Mutual Transport Layer Security (mTLS) has been frequently touted as the gold standard for secure communication in microservices ecosystems. mTLS is often heralded for its ability to provide both encryption and strong identity assurance. But like all tools, while mTLS has its merits, it also brings along challenges. Let’s delve into both sides of the coin.

An Advocate for mTLS

Unlike traditional TLS, which only verifies the server’s identity to the client, mTLS ensures that both parties in a communication exchange verify each other’s identities. This two-way handshake assures both the client and the server of each other’s authenticity. mTLS not only encrypts the data being transmitted but also guarantees that the communication is happening between trusted parties. This reduces the risk of man-in-the-middle attacks. By verifying the identity of each microservice, mTLS provides the ability to apply fine-grained access controls, ensuring that services only communicate with other services they are meant to. Since both parties are authenticated, neither can later deny having sent or received a message. This is awesome, sort of.

Challenges and Limitations of mTLS

Setting up mTLS, especially in large and dynamic environments, can be incredibly complex. It requires careful management of certificates for each service, which can quickly become unwieldy. Certificates aren’t everlasting — they expire. Managing the lifecycle of certificates, including renewal and revocation, can be a daunting task. The mutual handshake process, while enhancing security, introduces latency. In systems where low latency is crucial, the extra milliseconds added by mTLS can be detrimental. An expired or revoked certificate, or a misconfiguration, can lead to service outages. In a tightly coupled microservices ecosystem, this can have cascading effects. While mTLS secures communication between services, it doesn’t secure the data once it’s at rest. Also, it doesn’t provide application-layer content validation, potentially leaving systems exposed to attacks like SQL injection or XML bomb attacks. Just because communication is secure doesn’t mean the application is. There’s a risk that developers may become complacent, thinking mTLS has “covered” security, potentially neglecting other crucial security practices. Not all systems or services support mTLS, especially older systems. Integrating such systems in an mTLS-secured environment can be challenging. In large ecosystems, the constant generation, validation, and revocation of certificates can become a bottleneck, affecting the system’s ability to scale fluidly.

mTLS offers us a robust mechanism for secure, bidirectional communication in microservices architectures. But like any technology or tool, it’s essential to weigh its benefits against its challenges. The hidden cost of this control, complexity, we will discuss this further in a moment.

Web Application Firewall (WAF)

The final area we need to address is Services Traffic Analysis. When diving into the realm of web application security, the mention of a Web Application Firewall (WAF) is inevitable. Lauded by many as a potent line of defense against a myriad of web application vulnerabilities, WAFs are designed to scrutinize, monitor, and filter HTTP traffic to and from web applications. They act as a shield between web applications and the traffic they handle, ensuring that malicious requests never make it to their intended target. However, like any tool or system, they come with certain inherent challenges. Let’s delve into the specifics.

Let’s use a WAF

Traditional perimeter firewalls and intrusion detection systems (IDS) operate at the network layer and can’t decipher the nuances of HTTP traffic. This is where WAF steps in, operating at the application layer. By examining the content of HTTP requests, WAFs can detect and block malicious inputs, such as those leading to Cross-Site Scripting (XSS) or SQL Injection (SQLi) attacks. With the rise in application-layer attacks, having a tool like WAF is invaluable. It allows administrators to apply a set of tailored rules that can be quickly updated to respond to emerging threats. This ensures that even if an application has a vulnerability, a WAF can prevent its exploitation. Plus, some more advanced WAFs can act as a virtual patch, providing immediate protection even before the actual vulnerability in the application is fixed. Pretty sweet!

Challenges and Limitations of WAF

Deploying and tuning a WAF can be an intricate endeavor. A poorly configured WAF might generate a lot of false positives, potentially blocking legitimate traffic and hampering user experience. There’s a delicate balance to strike between being overly permissive (letting threats in) and overly strict (blocking valid requests). Maintenance is also an ongoing task. As new threats emerge, WAF rulesets need to be updated, which might require constant monitoring and expertise. WAFs can introduce latency, as they examine and process each HTTP request and response. For high-traffic websites, this can be a performance concern. A WAF does not remedy the underlying vulnerabilities within the application; it merely provides a layer of protection against them. Advanced attackers might employ evasion techniques to bypass WAF protections. They can obfuscate their attacks in ways that a WAF might not recognize, especially if the WAF is not regularly updated. Just as with mTLS, while WAFs protect against certain attack vectors, they cannot ensure the security of data at rest or shield against all potential risks.

The Trap

It’s an enticing thought. Set up an API gateway, add mTLS, front all this with a WAF, sprinkle in some networking restrictions, and sleep soundly knowing your microservices architecture is secure. While these technologies undoubtedly offer value, relying solely on them creates a potentially dangerous complacency. I would like for us to do better.

Service Mesh

So now what? Well as described above if we implement mTLS and want those security goodies we often read about…just implement a service mesh to handle the complexity. The adoption of service mesh architectures with Kubernetes has been growing due to the increasing complexity of soa/microservice deployments and the challenges associated with them. Let’s break down why using a service mesh may be a good idea.

One of the significant advantages of a service mesh is the enhanced observability it offers. You get a clearer picture of how traffic flows between services, helping with debugging and performance tuning. You get metrics, logs, and traces out of the box which can be invaluable for understanding service-to-service interactions.
Service meshes often provide mutual TLS (mTLS) encryption ensuring secure communication between services.
Service meshes provide out-of-the-box support for circuit breaking, retries, timeouts, and rate limiting. These features enhance the resilience of your applications, ensuring they can gracefully handle failures and prevent them from cascading through the system.
With a service mesh, you can apply uniform policies and configurations across a multitude of services, regardless of the programming language or framework they’re built with.

The Service Mesh Fanboy take is:

“Hands down, integrating a service mesh with Kubernetes for RESTful APIs is a game-changer. Why? Because it’s the culmination of modern infrastructure design! You’re not just ensuring that your microservices communicate fluidly — you’re supercharging them. With unparalleled observability, you’re practically giving yourself x-ray vision into your service interactions. The sophisticated traffic control? It’s like directing an orchestra, ensuring each microservice performs its part at the right moment. And let’s not forget the security — it’s like having an impenetrable fortress guarding your app data. The resilience features make your applications as sturdy as a superhero, able to take a hit and keep going. As for uniformity and reduced code complexity, it’s like magic — streamlining and simplifying in ways we could only dream of a few years back. Trust me, if you’re not using a service mesh with Kubernetes, you’re living in the past. #ServiceMeshForTheWin!”

Wow! Can it drive me to the store and solve world hunger too?

While there’s a lot of advantages in using service meshes, they’re not a silver bullet. Introducing a service mesh adds another layer of complexity to your stack. This means more components to monitor, more configuration to manage, and a steeper learning curve for developers and operators. You need to only search “service mesh complexity” and you will get an eyeful. There’s a performance overhead introduced by the sidecar proxies used in most service mesh architectures. While many service mesh solutions are robust and production-ready, they’re still relatively new in the grand scheme of things, and best practices are still being established.

In the world of Kubernetes and microservices, assume a service mesh is the definitive solution to many of our deployment and communication woes. While the advantages of service meshes, like Istio and Linkerd, are often extolled, it’s crucial for us to critically evaluate if they’re the right fit for every scenario.

Service meshes undeniably introduce another layer of complexity. As organizations scale, managing this added layer becomes increasingly challenging. The overhead of learning and managing a service mesh might not justify its benefits.
Introducing sidecar proxies for every service can have ramifications in terms of latency and cost.
Despite their growing popularity, service meshes are relatively new. We may face challenges related to best practices, updates, or unforeseen bugs. Are we ready to handle these uncertainties?
Adopting a service mesh is not just about integrating a new tool. It’s about embracing a new paradigm, which requires time and effort to understand deeply. Not every organization might have the luxury of time or the necessary skill set in-house.

I think the last point is the biggest. Adopting some technologies then force you to adopt others. This increases complexity, cost, and induces fear. You may be ready for one but not the other. In my personal opinion the biggest goal of a Security Architect should be the adoption of the controls they design. Your controls are no good if the organization cannot or does not want to adopt them.

Note: I am a little hard on service meshes above. I am using this particular technology as an example of how adding one technology can grow complexity. Service meshses are not that bad and can under the right circumstances be a valuable tool.

Although we have met our requirements and the areas we wanted to address we have introduced some complexities. Complexity reduces adoption. In our next installment let’s see if we can do better! We will explore some other options…https://number40.medium.com/services-runtime-security-part-4-df928bb8e8ca