Archive for the ‘Retry Pattern’ Category

XFactor

By: Cole Francis, Architect, PSC, LLC


THE PROBLEM

So, what do you do when you’re building a website, and you have a long-running client-side call to a Web API layer. Naturally, you’re going to do what most developers do and call the Web API asynchronously.  This way, your code can continue to cruise along until a result finally return from the server.

But, what if matters are actually worse than that?  What if your Web API Controller code contacts a Repository POCO that then calls a stored procedure through the Entity Framework.  And, what if the Entity Framework leverages a project dedicated database, as well as a system-of-record database, and calls to your system-of-record database sporadically fail?

Like most software developers, you would lean towards looking at the log files, offering traceability and logging for your code.  But, what if there wasn’t any logging baked into the code?  Even worse, what if this problem only occurred sporadically?  And, when it occurs, orders don’t make it into the system-of-record database, which means that things like order changes and financial transactions don’t occur.  Have you ever been in a situation like this one?


PART I – HERE COMES ELMAH

From a programmatic perspective, let’s hypothetically assume that the initial code had the controller code calling the repository POCO in a simple For/Next loop that iterates a hardcoded 10 times.  So, if just one of the 10 iterating attempts succeeds, then it means that the order was successfully processed.  In this case, the processing thread would break free from the critical section in the For/Next loop and continue down its normal processing path.  This, my fellow readers, is what’s commonly referred to as “Optimistic Programming”.

The term, “Optimistic Programming”, hangs itself on the notion that your code will always be bug-free and operate on a normal execution path.  It’s this type of programming that provides a developer with an artificial comfort level.  After all, at least one of the 10 iterative calls will surely succeed.  Right?  Um…right?  Well, not exactly.

Jack Ganssle, from the Ganssle Group, does an excellent job explaining why this development approach can often lead to catastrophic consequences.  He does this in his 2008 online rant entitled, “Optimistic Programming“.  Sure, his article is practically ten years old at this point, but his message continues to be relevant to this very day.

The bottom line is that without knowing all of the possible failure points, their potential root cause, and all the alternative execution paths a thread can tread down if an exception occurs, then you’re probably setting yourself up for failure.  I mean, are 10 attempts really any better than one?  Are 10,000 calls really any better than 10?  Not only are these flimsy hypothesis with little or no real evidence to back them up, but they further convolute and mask the underlying root cause of practically any issue that arises.  The real question is, “Why are 10 attempts necessary when only one should suffice?”

So, what do you do in a situation when you have very little traceability into an ailing application in Production, but you need to know what’s going on with it…like yesterday!  Well, the first thing you do is place a phone call to The PSC Group, headquartered in Schaumburg, IL.  The second thing you do is ask for the help of Blago Stephanov, known internally to our organization as “The X-Factor”, and for a very good reason.  This guy is great at his craft and can accelerate the speed of development and problem solving by at least a factor 2…that’s no joke.

In this situation, Blago recommends using a platform like Elmah for logging and tracing unhandled errors.  Elmah is a droppable, pluggable logging framework that dynamically captures all unhandled exceptions.  It also offers color-coded stack traces with line numbers that can help pinpoint exactly where the exception was thrown.  Even more impressive, its very quick to implement and requires low personal involvement during integration and setup.  In a nutshell, its implementation is quick and it makes debugging a breeze.

Additionally, Elmah comes with a web page that allows you to remotely view the unhandled exceptions.  This is a fantastic function for determining the various paths, both normal and alternate, that lead up to an unhandled error. Elmah also allows developers to manually record their own information by using the following syntax.

ErrorSignal.FromCurrentContext().Raise(ex);

 

Regardless, Elmah’s capabilities go well beyond just recording exceptions. For all practical purposes, you can record just about any information you desire. If you want to know more about Elmah, then you can read up on it by clicking here.  Also, you’ll be happy to know that you can buy if for the low, low price of…free.  It just doesn’t get much better than this.


PART II – ONE REALLY COOL (AND EXTREMELY RELIABLE) RE-TRY PATTERN

So, after implementing Elmah, let’s say that we’re able to track down the offending lines of code, and in this case the code was failing in a critical section that iterates 10 times before succeeding or failing silently.  We would have been very hard-pressed to find it without the assistance of Elmah.

Let’s also assume that the underlying cause is that the code was experiencing deadlocks in the Entity Framework’s generated classes whenever order updates to the system-of-record database occur.  So, thanks to Elmah, at this point we finally have some decent information to build upon.  Elmah provides us with the stack trace information where the error occurred, which means that we would be able to trace the exception back to the offending line(s) of code.

After we do this, Blago recommends that we craft a better approach in the critical section of the code.  This approach provides more granular control over any programmatic retries if a deadlock occurs.  So, how is this better you might ask?  Well, keep in mind from your earlier reading that the code was simply looping 10 times in a For/Next loop.  So, by implementing his recommended approach, we’ll have the ability to not only control the number of iterative reattempts, but we can also control wait times in between reattempted calls, as well as the ability to log any meaningful exceptions if they occur.

 

       /// <summary>
       /// Places orders in a system-of-record DB
       /// </summary>
       /// <returns>An http response object</returns>
       [HttpGet]
       public IHttpActionResult PlaceOrder()
       {
           using (var or = new OrderRepository())
           {
               Retry.DoVoid(() => or.PlaceTheOrder(orderId));
               return Ok();
           }
       }

 

The above Retry.DoVoid() method calls into the following generic logic, which performs its job flawlessly.  What’s more, you can see in the example below where Elmah is being leveraged to log any exceptions that we might encounter.

 

using Elmah;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace PSC.utility
{
   /// <summary>
   /// Provides reliable and traceable retry logic
   /// </summary>
   public static class Retry
   {
       /// <summary>
       /// Retry logic
       /// </summary>
       /// <returns>Fire and forget</returns>
       public static void DoVoid(Action action, int retryIntervallInMS = 300, int retryCount = 5)
       {
           Do<object>(() =>
           {
               action();
               return null;
           }, retryIntervallInMS, retryCount);
       }

       public static T Do<T>(Func<T> action, int retryIntervallInMS = 300, int retryCount = 5)
       {
           var exceptions = new List<Exception>();
           TimeSpan retryInterval = TimeSpan.FromMilliseconds(retryIntervallInMS);

           for (int retry = 0; retry < retryCount; retry++)
           {
               bool success = true;

               try
               {
                   success = true;

                   if (retry > 0)
                   {
                       Thread.Sleep(retryInterval);
                   }
                   return action();
               }
               catch (Exception ex)
               {
                   success = false;
                   exceptions.Add(ex);
                   ErrorSignal.FromCurrentContext().Raise(ex);
               }
               finally
               {
                   if (retry > 0 && success) {
                       ErrorSignal.FromCurrentContext().Raise(new Exception(string.Format("The call was attempted {0} times. It finally succeeded.", retry)));
                   }
               }
           }
           throw new AggregateException(exceptions);
     }
   }
}

As you can see, the aforementioned Retry() pattern offers a much more methodical and reliable approach to invoke retry actions in situations where our code might be failing a few times before actually succeeding.  But, even if the logic succeeds, we still have to ask ourselves questions like, “Why isn’t one call enough?” and “Why are we still dealing with the odds of success?”

After all, not only do we have absolutely no verifiable proof that looping and reattempting 10 times achieves the necessary “odds of success”.  Therefore, the real question is why there should there be any speculation at all in this matter?  After all, we’re talking about pushing orders into a system-of-record database for revenue purposes, and the ability to process orders shouldn’t boil down to “odds of success”.  It should just work…every time!

Nonetheless, what this approach will buy us is one very valuable thing, and that’s enough time to track down the issue’s root cause.  So, with this approach in place, our number one focus would now be to find and solve the core problem.


PART III – PROBLEM SOLVED

So, at this point we’ve relegated ourselves to the fact that, although the aforementioned retry logic doesn’t hurt a thing,  it masks the core problem.

Blago recommends that the next step is to load test the failing method by creating a large pool of concurrent users (e.g. 1,000) all simulating the order update function at the exact same time.  I’ll also take it one step further by recommending that we also need to begin analyzing and profiling the SQL Server stored procedures that are being called by the Entity Framework and rejected.

I recommend that we first review the execution plans of the failing stored procedures, making sure their compiled execution plans aren’t lopsided.  if we happen to notice that too much time is being spent on individual tasks inside the stored procedure’s execution plan, then our goal should be to optimize them.  Ideally, what we want to see is an even distribution of time optimally spread across the various execution paths inside our stored procedures.

In our hypothetical example, we’ll assume there are a couple of SQL Server tables using complex keys to comprise a unique record on the Order table.

Let’s also assume that during the ordering process, there’s a query that leverages the secondary key to retrieve additional data before sending the order along to the system-of-record database.   However, because the complex keys are uniquely clustered, getting the data back out of the table using a single column proves to be too much of a strain for the growing table.  Ultimately, this leads to query timeouts and deadlocks, particularly under load.

To this end, optimizing the offending stored procedures by creating a non-clustered, non-unique index for the key attributes in the offending tables will vastly improve their efficiency.  Once the SQL optimizations are complete, the next step should be to perform more load tests and to leverage the SQL Server Profiling Tool to gauge the impact of our changes.  At this point, the deadlocks should disappear completely.


LET’S SUMMARIZE, SHALL WE

The moral of this story is really twofold.  (1) Everyone should have an “X-Factor” on their project; (2) You can’t beat great code traceability and logging in a solution. If option (1) isn’t possible, then at a minimum make sure that you implement option (2).

Ultimately, logging and traceability help out immeasurably on a project, particularly where root cause analysis is imperative to track down unhandled exceptions and other issues.  It’s through the introduction of Elmah that we were able to quickly identify and resolve the enigmatic database deadlock problems that plagued our hypothetical solution.

Regardless, while this particular scenario is completely conjectural, situations like these aren’t all that uncommon to run across in the field.  Regardless, most of this could have been prevented by following Jack Ganssule’s 10-year old advice, which is to make sure that you check those goesintas and goesoutas!  But, chances are that you probably won’t.

Thanks for reading and keep on coding! 🙂