Transient fault in web application is an communication error caused by dependency calls.
Most often downstream network failures are classified as transient faults. We may all have visited a website that did not return a web page the very first we requested for. On refreshing the website seems back. This is classic case for a Transient fault.
What can cause Transient Faults?
In a services oriented architecture, transient faults are very common since applications are hosted on distant servers (spread across regions, states, and countries) and all the communication messages flow through the internet. These messages are sent over the internet concurrently by breaking them into smaller packets. Tiny packets are routed through various regions and reach the destination based on best path routing algorithms.
However such a complex navigation paths are often affected by software glitches and hardware failures leading to lost packets. The chances of transient faults grows more as more and more applications are moved to cloud. Say you are hosting your web application in cloud, and database is also in cloud however in a different region or server; you may encounter transient fault more frequently than a monolithic application these cases.
Transient faults are side-effects of modern cloud based web applications.
How to detect Transient Faults?
There is no easy way to differentiate transient faults from application failures. Applications are designed to abstract intrinsic details from clients in case of failures leading to even more confusion.
Say if you are looking up a product catalog and a 404 error occurs; it may not be clear if this a temporary condition or because a search criteria did not return any result. Only if the application differentiates very clearly by using clear cut response messages, we can detect any Transient faults.
If it may be possible, study the complete behavior of the web api and record all the expected possibilities for various sets of input data. This may not guarantee 100% results, however it will at least improve some part of the application that may be slowing your overall performance.
Retrying to auto heal Transient faults.
Transient faults are auto healed if you may retry the request immediately in case of intermittent failures.
By establishing a retry strategy like immediate or timestamp spacing requests we can heal transient faults.
Queue centric pattern is one of the front runners for technical solutions for this ambiguous problem.
Timestamp spacing may seem like a safer option however may not be the most efficient in case of real-time systems.
Retry strategy : Queue centric pattern.
Queue centric pattern is a decoupled architecture and a simple idea to retry dependency calls in case of transient fault. Simple queues are used to decouple request failures from slowing down each other and from completing a workflow.
So if you have to process a request through various downstream applications, and your request is not considered complete until all the stages are passed successfully, then by dropping these requests into individual bucket designated for each stage; we are proofing consecutive requests from failing each other.
Such a scenario is a classic use case for Queue centric pattern. When a transient fault happens at stage 2 of total 5 stages, then putting the request into queue 2 for retrial; we can essentially start processing the rest of the requests asynchronous.
Queue centric pattern has a positive residual affect on scalability.
By developing async based code, we are scaling at the very fabric of the Micro service.