Robust software approach

Decoupling with message queue

As anyone who has tried to build any non-trivial application will say: building software is hard. We need to speak to the computer in a language that he will be able to understand. And we need to make sure that the data flow is right. And things are happening at the right time.

Adding to the complexity inherent in the software the need for robustness makes it even harder. Not only the application have to do the right thing but also cope with any potential problems. The execution environment is not perfect. Things will break and fail and the application needs to survive.

Fortunately, the software industry has come up with a lot of approaches to that problem. Proven solutions which makes the task much more approachable.

One such approach, inspired by Erlang/OTP is to split the application into a bunch of independent processes. They will communicate with each other with the help of persistent message queue.

There will also be a process manager, a master process which will ensure that all other processes are up. In a case of any them exiting, the task is to start it again, to make sure that the application as a whole will continue to function.

This approach forces each process to be created with certain assumptions.

First of all, in the case of any non-recoverable error, it should log the error and then just exit. This will ensure that whatever happens the rest of the application will continue to function. The failure will be restricted to just one process.

The second requirement is to ensure that any work can be interrupted and later resumed. Ideally, any task should be idempotent.

This second requirement ensures that any crash will not cause any loss of work. From the external point of view, any task must be either finished or not. In between state is not permitted. That way if the process didn’t finish and crashed it can pick it up when it will be restarted.

This way of structuring an application takes care of network flakiness. It doesn’t matter that the connection was broken. When the process gets restarted it will try again, with the high probability that the networking problem went away.

Another benefit of this approach is the fact that the system, as a whole, can survive any crash of its components. Maybe the work will back up a bit in the message queue, but at least, nothing will be lost. Contrasting that with a monolithic application, where any one part failing could potentially bring down the whole application.

It should be possible to update any individual part without affecting everything else. The caveat here is that any change is either very localized or carefully orchestrated between processes, especially if the communication format must be updated.

The system can be even more robust if each of the process (or a group of processes) will be placed on a separate machine. That way even catastrophic events like disk failures will have much a lesser impact.

As everything in software, there are tradeoffs. All that resilience costs added complexity. It requires more time to develop, deploy and monitor. It also restricts how each process should be composed. But when the cost it’s worth, it’s definitely a great approach.