Friday, December 14, 2012

Task is NOT a Panacea - Cancelled in the void


In a recent code review, I spotted the following code pattern which is like a deadlock, hard to pin down, but surely prone to issue:
protected virtual void putIntoLocalCache(string cacheKey, T obj)
{
  //put object in local cache
  var task = Task.Factory.StartNew((object item) =>
  {
    localCache.Put(cacheKey, item);
  }, obj);
}

For background sake, this method is w/i a library that is likely called from w/I IIS, either an ASP.NET HyperThread or a thread from WCF’s I/O-Thread Pool, each of which can cancel, thus cancelling any thread that they control (especially ones without anyone waiting on them).

Imagine that there is a FxCop rule that will fail the build any time a method starts a task and does not either wait on it or return it or a task that is waiting on the task, such as through Task.WhenAll().

Failing to heed this rule will result in tasks being cancelled, thus data losses or inconsistencies that we will spend great deals of time on.  AND this, like a deadlock issue only gets worse with scale.

My guidance(s) that this code brings up follow:
  • First, do no harm (to the data).  If an optimization can introduce data loss, do not perform the optimization.
  • Yes, you can optimize for the writer, but first, optimize for the reader.  Generally the writer has a much larger vested interest in the data, so is more apt to wait for the certainty of writing that data.
  • Wherever the writer is willing to fire and forget, async write, but in a manner that is visible, manageable, ie higher level task, a queue-entered workflow.
  • Read-through cache does NOT mean write to all caching layers.  Read-through means writing remote then invalidate caching layers (inversion of these steps results in a race condition).  It is a race-condition-prone implementation to write to cache concurrent to writing to the source of truth.  That said, if the remote is very distant in terms of time between the write and the eventual read back, an optimization may be introduced to write to caching layers, but must be accompanied with a short TTL, which should be equivalent to the eventual consistency SLA timespan.
  • Ensure that the cache supports TTL or forces a short-lived TTL.  Cache is not a permanent store.  Readthrough to the source of truth must be fast enough, meeting the SLA.  Cache is just an optimization.
  • Ensure that the cache can be flushed by a reconciliation process (a single-item delete is exposed).
  • If the writer is not willing to fire and forget and a write operation exceeds 3s (pluggable SLA number) under expected running conditions, break the write down into multiple component writes, using statuses on the component resources to determine the status of the composite resource.  This is basically a user-interactive workflow, ie a wizard.

In this case, I would simply await the cache write.  The cache write is fast enough.  And if it is not, that is the problem that should be addressed.  If this was writing to a remote store that is slow, ie AWS Glacier storage, then I’d figure a way to async it better.

The alternative is to bleed Task everywhere.  The interface for ICache should not be returning void here (well actually the public method that is calling putIntoLocalCache()), it should be returning Task and waiting only when a sync point needs to be introduced, ie when the result is needed.  I'm not a big fan of bleeding internals, but we'll save that for another day.

No comments:

Post a Comment