This content originally appeared on Level Up Coding - Medium and was authored by Gregory Pabian
Any distributed system might run into emergencies, either because of an accident or an attack. A supposedly harmless action, like the execution of a script, might wreak havoc on an entire platform, rendering it unusable by end-users. This article shows how software developers can prepare a system for emergencies before they even occur.
I want to write a disclaimer that I am not a software security expert, but a full-stack software developer, and all information provided in this article come from my coding experience. I want to discuss the major possibilities of preparing and dealing with emergencies, but I do not intend to provide an opinion on the correct and incorrect approaches. In the end, designing a secure distributed system likely requires the expertise of skilled security engineers.
Emergency Protocol
I believe that system designers might consider preparing an emergency protocol, at least to understand how they can protect a distributed system against both external and internal threats. I learned a lot about web application security by reading various OWASP resources.
I think that the emergency protocol should contain answers to, at minimum, the following questions:
- who can initiate the emergency protocol and with which rights,
- what happens during the protocol execution and where does it happen,
- how can somebody initiate it and how long does it last.
For this article, I would assume system administrators can use a CLI to initiate the emergency protocol with the highest rights on the system. The invocation of the protocol curbs certain functionalities on multiple backends for an indefinite period.
Timeframe
The protocol might work under two timeframes:
- the past and the future one — the system will alter its behavior for specified past and all future events,
- the future one — the system will alter its behavior for all future events.
Naturally, not all distributed systems support retrograde behavior changes. In a system that uses long-running transactions, adhering to such a diversion requires executing compensating transactions that remove the effects of original past transactions explicitly.
Solutions
I classified all possible solutions into three groups — ones that block access in stateless protocols like HTTP, ones that block access in the message-based systems like Kafka, and ones that roll back transactions. A system architect might choose to apply some of them, even in parallel, as long as they serve a practical purpose in a designed system. Please leave a comment in the comment section if you can think of more useful solutions.
Blocking access in stateless protocols
Arguably, blocking a specific access token (e.g, a JWT) serves as the most precise way of remedying an emergency by restricting access to resources for a particular user. The system can apply this obstruction if an attacker has gained an access token and might use it with a nefarious motive. This solution might affect one or more backends, depending on the exposure.
The system can block access to a specific access method as well (for instance, an HTTP endpoint or an RPC). A system administrator might choose this solution when they do not know the access tokens of a specific attacker or when an endpoint stopped working correctly due to a recent deployment. I believe, before choosing this way of responding to a problem, one should contemplate the consequences of locking away a whole functionality and how to communicate that choice to end-users.
If blocking some access methods in some service does not work, one might consider shutting down that service. In a distributed system, this translates as either disabling the load balancer that forwards requests into instances of the service in question or terminating all the aforementioned instances. Picking whichever option will have a significant impact on the system, as many functionalities will no longer work.
Finally, system administrators might decide to shut down the entire system (or, at least all the backends). The precise algorithm for turning a distributed system off depends heavily on its infrastructure and used redundancies. If someone takes that possibility into account, they should think about the consequences as the system will come to a halt, aside from some frontend features that might work in the offline mode.
Blocking access in message-based systems
In a message-based system, system administrators might ban sharing specific messages within a system. In some of them, message brokers can apply such a policy, while in others, this job belongs to backends. The reader can think of such rules as predicate functions that take one parameter, a message, and return a boolean value.
Additionally, one can choose to block a specific topic or a channel from distributing messages. This will indirectly affect backends that rely on such a way of communication. Alternatively, one can forbid a particular service to send or receive messages over a topic or a channel.
Lastly, someone might decide to block the entire message pipeline. This solution, just like turning down the entire system, has tremendous effects on communication within the system, possibly disabling many features. Again, software developers should figure out which functionalities might break down in such a situation before it happens.
Rolling back transactions
The system might roll back certain transactions just before backends commit them into a relational database. Applying such a rule requires adding a validity check into backend architectures and making sure that backends can receive information about policy changes in time. This solution serves as an example of canceling already-started transactions.
Feasibility of rolling back distributed transactions depends on the chosen algorithm. In systems that use the saga software pattern, one can start compensating transactions to cancel out effects of original, long-running transactions. As well as previously, the proposed solution can revert transactions enacted in the past.
Distributing a solution
The system can distribute a response to an emergency in two fundamental ways: a centralized one and a decentralized one. The centralized option involves either an emergency orchestrator or an emergency choreography. The decentralized distribution requires a way to transport the information to all involved instances of various backends.
Centralized distributions
An emergency orchestrator manages the crisis response by giving precise instructions in a predefined order to specific backends, serving as the central authority. This way, a single service might contain the entire code responsible for the aforementioned response. For instance, the orchestrator can notify particular backends about a compromised JWT.
An emergency choreography enforces rules of engagement in case of an accident for all the services within an infrastructure, within defining a backend to manage the response. It expects services to exchange information regarding a threat using predefined channels of communication. For example, system administrators can register a stolen JWT using an API exposed by a user service, which then will propagate that information to all backends that rely on authentication.
Decentralized distributions
As mentioned before, software architects could achieve decentralized distribution by creating an emergency message pipeline. On one end, they could access it using a CLI or a GUI. On the other, all the backends within a system could listen to all the messages on that pipeline and choose to react to some of them if necessary.
Designing a unified message pipeline requires thinking of a structure that describes the emergency response in a format understood by backends. I believe such a structure should not contain information about any service in particular, but rather a concise command on what to do. For instance, a request to shut the entire system down concerns every backend but a command to temporarily ban sending e-mails behooves only an e-mail service.
Applying a solution
Apply a solution to a backend depends on the very solution, the software architecture of the backend, and the way the backend received the information about an emergency. Even though I discuss backends, in reality, the solution has to reach each backend instance to work. System designers can achieve it using an emergency message pipeline or a service registry.
In backends that use dependency injection, a list of emergency-based restrictions can reside in a component reused by all request or message handlers. One could achieve similar results with global variables but I discourage such an approach. In the end, handlers can run a check function to verify if some restrictions apply to the currently evaluated request.
If the system uses emergency orchestration, a backend that receives emergency information needs to share it with other parts of the system, either via stateless requests or a message pipeline. A failure of the backend to do so might stop the emergency response from spreading. System designers should find ways to circumvent that obstacle, for instance, by having a previous service waiting on a confirmation of information processing by the next service before continuing.
Summary
As I hinted at the very beginning of the article, there are multiple ways of dealing with emergencies in distributed systems, as different systems use different patterns for backend-to-backend communication. I believe trying to fit crisis responses into developed systems might provide a lot of learnings in terms of system design. If you have any questions, please leave them in the comment section.
Emergencies in Distributed Systems was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Gregory Pabian
Gregory Pabian | Sciencx (2021-04-18T20:25:28+00:00) Emergencies in Distributed Systems. Retrieved from https://www.scien.cx/2021/04/18/emergencies-in-distributed-systems/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.