DevOps Blog

Stay in the know with in-depth articles about DevOps, micro-services and cloud native topics, delivered to you weekly.

What Is Service Mesh and Why You Need It?

Written by Marius Rimkus
on January 28, 2020

The rise of microservices as the de facto software architectural style of modern organizations and the rapid adoption of a cloud-native application model has brought various changes to the application infrastructure and application management.

While we briefly touched upon the topic in our DevOps Cloud-Native Tool Landscape, this article will delve deeper into the details. To understand the reason behind server mesh, we first need to understand the cloud-native model. A cloud-native application typically comprises hundreds of microservices and depending on the size of the service, each service may further divulge into thousands of instances.

When you add in the orchestration-enabled scheduling into the mix, the resulting microservice structure is highly complex which makes inter-service communication a very difficult process. This is where service meshes come in.

 

What Is A Service Mesh?

To put it briefly, a service mesh is a dedicated layer of infrastructure that is added to ensure a streamlined communication process in between the microservices that comprise your application.

As a configurable infrastructural layer that operates on low latency, a service mesh is targeted to handle the increasing volume of inter-process communication that takes place between application infrastructure services.

With the rise of the cloud-native application model, developers are facing the prospect of an ever-increasing number of microservices. This has resulted in an increased need to make inter-services communications as fast and secure as possible.

In the practical world, the implementation of a service mesh is done by deploying a vast range of network proxies (labeled ‘sidecars’) that work alongside the application code. There is a sidecar deployed for each server instance.

Is the application aware of the deployment of such proxies? Principally, the application doesn’t need to. However, depending on the deployment, the developer may decide to make the application aware of such network proxies.

In recent times, organizations have increasingly adopted the service mesh as a vital component of the cloud-native application model. Renowned organizations that have adopted service mesh include PayPal, Ticketmaster, and Credit Karma.

Even the Cloud Native Computing Foundation has accepted Linkerd – the popular open-source service mesh – as an official project. This is a testament to the growing popularity of the service mesh in the modern application landscape.

 

Service Mesh As A Networking Model

There have been various questions posed about whether the service mesh is a networking model. While a service mesh is definitely a networking model, it operates in a layer of abstraction structured above the TCP/IP.

This networking model works with the predefined assumption that the underlying L3/L4 network is operational and delivering bytes between various inter-services points. Another assumption of this service mesh networking model is that the current network remains unreliable, which means the mesh should be equipped with network failure capabilities.

Similarities With TCP/IP

As mentioned above, the service mesh sits at an abstraction layer above the TCP/IP layer, but it shares various features with the communication protocol. Here are some similarities:

  1. The TCP stack is tasked with abstracting the mechanism of delivering bytes between network endpoints reliably. Likewise, the service mesh is also tasked with the abstraction of the technicalities to configure secure delivery of every request to the related service.
  2. The service mesh has nothing to do with the defined payload and the technicalities of its encryption. This is eerily similar to how TCP operates regardless of the payload and the details of how it is encoded.
  3. Just like the TCP/IP protocol has in-built failure troubleshooting abilities, the service mesh also aims to achieve their assigned objective (for instance, ‘send X from service 1 to service 2’) regardless of any failures on the way.

However, there are prominent differences between the two networking models that allow the service model to operate a tier above TCP/IP protocol. While the latter model is designed towards getting the assigned task done, the service mesh operates on a broader spectrum.

In addition to accomplishing assigned objectives, service meshes are responsible for allowing developers enhanced visibility and control of the app runtime. While service communication was more of a back-end task that operated independently, the integration of a service mesh allows it to be monitored and managed more effectively.

 

How Does A Service Mesh Actually Work?

The need for an advanced communication protocol arises from the complexity of modern-day applications. The integration of a service mesh not only streamlines the entire process but also allows the entire team to work more efficiently.

The network proxies (sidecars) that are attached with each server instance are responsible for various functions that would otherwise be done by each microservice on its own. These functions include inter-service communication and monitoring of any security-related activity.

This allows a clear distinction in how the application is managed, as developers can now focus solely on the application code management, which involves development, support, and maintenance. On the other hand, the operations team can effectively manage the service mesh and application running services.

Modern-day service meshes, such as the open-source Linkerd, deal with these problems by utilizing various advanced techniques.

Some of these techniques include:

  • Circuit-breaking
  • Load balancing (that factors in latency requirements)
  • Always available service discovery
  • Deadlines and retries

The service mesh is tasked to utilize these features and ensure they work in conjunction with a complex environment in which they operate.

A Step-By-Step Guide On How Service Meshes Work

To elaborate on how service meshes operate, we will use an example scenario. For this instance, we will walk through the events that occur once a service mesh (Linkerd) receives a request:

Step 1

Once the request is received, the service mesh is tasked with identifying where the service should be directed. There are various questions that need to be answered:

  • Is the required service in the production stage or the staging stage?
  • Is the service located in the on-premise data center or in the cloud?
  • Does the request refer to a version of the microservice that is still being tested or a version that has been tested and deployed to production?

Such a tiered decision-making ability allows the service mesh to locate the correct destination. Additionally, all of these routing protocols are configurable and can be applied for both types of traffic –global and arbitrary.

Step 2

Once the exact destination of the required microservice is located, the service mesh must retrieve the server instances which correspond to the details outlined in the request from the discovery endpoint.

If the retrieved data is contrary to what the service mesh has usually monitored in practice, it will make an instant decision on which source of information to trust. 

Step 3

The final instance is chosen on the basis of their speed of response. How does the service mesh measure that? It monitors the observed latency for recent requests and then chooses the instance that took the least time.

Then, the service mesh sends the request to the instance, and if successfully executed, records the latency and response type of the resulting outcome.

Step 4

Sometimes, a server instance may turn faulty which hinders the successful execution of the request. This may be due to an instance that has either failed, become unresponsive or is down for maintenance.

In such a scenario, the service mesh retries the request on another instance. However, the only catch with this practice is that the service mesh will only retry with another instance if the request is idempotent. Idempotent requests lead to the same result, regardless of the service instance they are executed in.

Step 5

Once the request executes successfully, modern service meshes allow the documentation and observation of key metrics.

They analyze every detail of this behavior and record them in the form of metrics and distributed tracing – and then transferred to a centralized metric system.

That’s Not All

Remember when I told you that a service mesh has advanced failure troubleshooting ability? If one server instance is constantly returning errors, then the service mesh doesn’t simply ignore that instance.

The service mesh removes such a server instance from the entire load balancing pool. This allows conversion of resources and increases efficiency.

Here’s another example that demonstrates the efficacy of modern service meshes. When one request elapses due to deadline, the service mesh recognizes such requests and automatically fails the request. Instead of continuing to retry a failed request and add load on the network, service meshes ensure efficient allocation of resources.

In addition to other functions, service meshes can also execute protocol upgrades, switch traffic dynamically as well as initiate and terminate TLS among other services.

 

Benefits Provided By A Service Mesh

There are various benefits of utilizing service meshes as an abstraction layer. Here are some of them:

Increased Standardization

As modern-day applications continue to get distributed over a large base, their functional behavior is also getting extremely erratic – depending on the underlying supporting network. With such varying behavior across different networks, ensuring round-the-clock availability of an application can become a stern challenge.

With a service mesh, the mesh handles the spindling, folding and mutilating of the application. It makes the data center more standardized and organized.

Enhanced Visibility

Advanced service meshes analyze request behavior to determine the most requested components so that they can be located to become easily accessible. Coupled with its problem troubleshooting ability, it has a comprehensive overview of the entire system.

As mentioned before, a service mesh can store this data for further use. Developers can use this data to identify trends and threats, which can go towards improving the entire development and build process.

Advanced Security

Contrary to monolithic software architectures, microservices software consists of different kinds of services. While some may have considerable life, others have a very brief lifecycle. This renders the assignment of unique identities and enforcement of work policies increasingly complex.

Instead, developers can use service mesh to implement the related policies on all live instances and across all operational microservices. Thanks to its ability to identify which microservice is operational and its running location, it can apply policies based on their type and behavior.

Not only this negates the need to assign a unique ID to each service or instance but it also allows developers to enforce policies without fail.

 

Major Service Mesh Tools

There are various service mesh tools available for developers to use. However, unlike other aspects of a cloud-native application, the tools available in this field are mostly open-sourced projects. There are no off-the-shelf commercial tools when it comes to service meshes.

Here is a list of the most popular ones:

Linkerd

Pronounced ‘linker-dee’, this is the oldest of all service meshes and was released in 2016. It basically started off as a spinoff project designed from a library developed at Twitter.

There was another service mesh, Conduit, which gained widespread popularity. However, it was integrated within the Linkerd program in 2017 and was instrumental in the creation of Linked 2.0.

Envoy

Envoy is yet another service mesh that traces its origin back to a renowned organization. In this case, it was Lyft. Envoy targets the ‘data plan’ segment of a service mesh. In order to provide complete functionality, it needs to be used in combination with a ‘control plane’ service mesh.

Istio

In order for ‘data-plane’ service mesh platforms like Envoy to work, organizations required a ‘control plane’ service mesh. This is why Istio came in to being. As a collaborative effort between IBM, Google and Lyft – Istio and Envoy operate as pairs, providing complete intra-platform feasibility between the two service mesh components.

HashiCorp Consul

HashiCorp essentially operated as a distributed system for service discovery and configuration. However, with the release of Consul 1.2, they introduced a feature called Connect which allowed service encryption and identity-based authorization. This turned it into a complete service mesh offering.

 

 As developers continue to deal with the complexity that comes with managing microservices, the service mesh will continue to experience a rise in adoption in the cloud-native application ecosystem.

With a thriving community of users and developers leveraging service meshes, organizations all around the globe have started to integrate a service mesh in their software architecture design. As computing continues to evolve, service meshes will also see a change in their scope in the application environment.