Vukoje's blog about software development

About data caching

In my Code Project article “Demystifying concurrent lazy load pattern” I have explained what common mistakes in lazy load implementations and how to implement fast concurrent lazy load cache. In this post I will discuss why caching itself is important and what are some implementation strategies for caching.


Why caching?


Simply put... performance! Usual business application relies on lots of data, and that data usually resides in a database. Because data is out of the application thread, accessing it can be much slower. If external data is on a hard drive (and it usually is), accessing it can be thousands of times slower than accessing data in application memory. The great thing about cached data is that not only it will be few thousand times faster, but it will also scale much better under heavy load than your poor database.

Nowadays, cached data is becoming more and more important with the emergence of high load web sites with many concurrent users. In addition, we see the rise of distributed key-value databases acting as a shared cache.

In our application, we have increased the speed of a critical business feature, Order lines import, for about 50 times after data caching. Speed went from 1 row per second to 50 rows per second.

Sounds great, and best of all it is easy and it just takes a little caring about your data. Unfortunately, I have seen many problematic implementations of caching that will either occasionally break in production or have very low performance and those implementations are the main reason for writing this article.


Caching strategies


There are several choices to make when implementing a data cache. Some of the facts that could affect these decisions are:

  • The size of the data.
  • How and how often is the data used?
  • How and how often is the data changed?
  • Can eventual consistency (stale data) be tolerated?


I will give my favorite example. Content Management System (CMS) sites are usually completely dynamic, which means that site menus and page content are created from some model usually defined in a database. This data is rarely changed and it is accessed very often. Imagine this site has a thousand of concurrent users and a busy administrator adding new pages every day. This might seem like data is changed often, but in fact data will be accessed millions of times before it is changed. Once in a day might not seem rare, but measured in processor ticks, it is millions of years. Also, this data is very small which means caching it will not be a big impact to server memory and if some user does not see new pages for a few minutes it is not that big deal.

Data loading


In my lazy load pattern article, I was loading all Products at once when building a cache. An alternative to this approach is to load single Product once it is requested, if it is not already loaded in cache. Choosing the right implementation depends on the size of the cached data and its usage pattern.

As I mentioned earlier, Order lines import is a critical process in our application. In this process, we are accessing Supplier data for each Order line that is being imported and there can be few thousand lines being imported. This means we will be accessing the Supplier DB table a few thousand times per process. On the other hand, the Supplier table in DB is quite big and loading it into memory would take too much time and memory. It turns out that in practice, for one importing process (and all lines being imported in it) only a few Suppliers are needed, so a single Supplier caching improved our performance significantly with just 10 lines of code.


Data accessing


Once you load data in cache, how do you access it? Here are some of the alternatives:

  • Sequentially - by iterating through the whole list of Products and looking for matching Product
  • Directly - by building some kind of memory index. For example, once we have loaded Products we could have built a hash list in which key is Product Name and value is pointer to Product instance with that name. This would allow us direct access to the Product instance by using its name.
  • Combined - by splitting our Product list into smaller lists by some criteria. For example, we access the list of Products with some Product Type directly using a memory index and then iterate over it. Combined access is great choice when you have various criteria for data searching and direct access cannot be applied or preparing all memory indexes would be to time or memory consuming.


When choosing your data access approach you have to weigh the performance vs. complexity increase. Here is an example of a performance gain we got after refactoring our Price Calculator cache:

  • Sequential access - 47ms
  • Combined - 0.02ms


Sometimes it is important for some application logic to work with the data from one point in time because otherwise inconsistent data might appear in the application. If this is the case, logic executing over cached data must be aware that the cache might reload at any time. 

For example, we have a rule that our database contains product A or product B, one of them must exist in the database. Let us say that at time t0 DB contains only product B, and at time t1 a transaction occurs in DB that adds product A and removes product B. 

Below is the code demonstrating the potential problem in this case. 


Code sample 1: Inconsistent data cache

// t0: Product A doesn’t exist
If (!this.HasA(Cache.GetProducts()))
    // t1: cache is reset (A added, B removed)
    // Product B doesn’t exist so else branch is executed        
    If (this.HasB(Cache.GetProducts()))
        // we reach this part of code which means that
        // DB doesn’t contain product A nor B
        // and this case is not possible

The solution is again simple. We store cache in a local variable to make sure we work with data from one point in time, when this is important.


Code sample 2: Consistent data cache

// t0: Product A doesn’t exist 
List<Product> localProduts = Cache.GetProducts();
If (!this.HasA(localProduts))
    // t1: cache is reset (A added, B removed)
    // but it doesn’t affect our local variable
    if (this.HasB(localProduts))


Cache scope


It is important to understand who is sharing cached data because this fact influences the cache implementation. The goal is also to narrow down the scope as much as possible to gain naturally partitioned data that will have a lower data load and less concurrent reads.

Here are the cache scopes I could think of:

  • Multiple applications - Most complex scenario where data is shared by users of multiple applications. Consider using third party solutions for distributed key value stores to handle complexity of distributed cache. Consider having separate cache for each application to avoid the complexity of a distributed cache. Potential problem of this approach could be inconsistent data across applications or too big data load.
  • Single application - Data is shared by users of a single application
  • Database - If you have single application that connects to different databases when a user logs in, then you should probably cache data per database instance and not per application instance.
  • Session - Usually implies user specific data. Nice thing about this cache is that if you store it in application session store, it will naturally expire once the session expires (user logs out or is inactive for some time)
  • Use case - in this case, cache is usually reset once form is closed


Resetting cache



Usually we can reset cache by reloading data or setting it to null and letting it lazy load of first data request.

In our examples, we have implemented cache resetting by setting cached Product list to null. 

There are few reasons why I do not recommending reloading cache data on cache resetting:

  1. It is slower
  2. Loading logic is duplicated
  3. It is a waste of resources. Instead of loading data as needed, we are always loading it.




The easiest caching scenario is the one where only the application implementing the cache alters the cached data in DB. In this case, the cache should be reset once data is changed by the application.

If the data is changed by some other application, then caching becomes trickier. You can consider caching data for some time interval if you can tolerate inconsistent data in between. For example, can we tolerate that our product list is refreshed every 30 minutes and can contain stale data?

In our application, we have a service synchronizing our database with external data sources once or few times a day. Once this service is done with synchronization, it triggers a web service to notify the application that synchronization is done. We use this notification to reset our cached data that might have been updated by service.

There were several attempts in .NET to have a cache that will reset automatically once some table in MS SQL Server changes. We tried this approach few years ago and it caused us many headaches. 




Caching is something you will probably need in your applications, although caching is not perfect and has its downsides. One of the downsides is increased application complexity. You should always weigh complexity gains versus performance gains and add caching only if it significantly increases your performance or scalability.

Do not underestimate power of a DB querying engine and DB indexes and use that as an excuse for caching everything. If your DB data access is slow maybe you should double check DB indexes. If you are sure you need the full edition of Oracle to have good DB performance, check indexes again. Bear in mind that you do not want to end up building your own in memory querying language because you will fail. DBs are better options than cache for large data and for complex data matching criteria.

In our application, we have around ~500.000 Products in the DB, and guess what? We are not caching them because cache building would take too much time and memory. In addition, we have some advanced Product search queries in T-SQL we did not want to code in C#. By the way, once we “optimized” our critical Product search DB queries we got average executions of 1s. Once we dug into the indexes and really optimized queries, we got average 0,01s execution.


software requirements


In my earlier posts I have written about why should software be documented and what should we document, and today I will write about software requirements. Requirements are written documents that describe system that should be developed and serve as communication tool between customers and developers. Requirements are also thinking tools that help you understand what you need to build so you don't waste money building the wrong thing.

"The very act of writing a specification forces you to think through the design you thought you had in your head, and helps you see the flaws in it quickly so that you can iterate and try more designs. Teams that use functional specifications have better designed products, because they had the opportunity to explore more possible solutions quickly. They also write code faster, because they have a clearer picture when they start of what’s going to be needed." [Joel on Software]


Approach to writing requirements


You can organize your requirements in more or less formal or agile fashion but the main point in requirements isn't the document templates and complex diagrams. The main point is information. I learned this from few starting chapters of Writing Effective Use Cases (Alistair Cockburn). I was expecting to find template for Use Cases that will help me write better documentation, and instead I found out that approach to writing documentation is more important than document template.

We programmers usually see use cases as boring part of work that is holding back real work and we have urge to begin coding as soon as possible. What happens is that we don't really analyze requirements, we just write them down and discover errors in requirements when coding when we already spent solid amount of our time building the wrong thing ("Hmmm, this drop down list shouldn't be here..."). If we took time to think about requirements before implementing them, evaluated hidden problems and scenarios and validated basic business values defined in project vision, we could detect errors earlier and have more stable requirements which in the end lead to better code and happier programmer life.

Amount of written requirements


Having no requirements is not a good idea and you can not use agile methodologies as excuse for it. Writing a functional specification is at the very heart of agile development, because it lets you iterate rapidly over many possible designs before you write code. Also requirement gold-plating is another extreme approach that leads to waterfall software development and many of its problems. As always there is no silver bullet, you have to find solution that works for you.

What should we document?

There are lots of things that can be documented in software development. That doesn't mean that you should document them all. You should document things that are important and specific, things that everybody working on project should know and that you will forget if you don't write them down.  Most important things that should be documented are:

  1. project vision
  2. requirements
  3. architecture and code

Project Vision

"The Vision summarizes the "vision" of the project. It servers to communicate the big ideas regarding why the project was proposed, what the problems are, who the customers  are, what they need, and what the proposed solution looks like. The vision defines the customer’s view of the product to be developed, specified in terms of the key needs and features. Containing an outline of the envisioned core requirements, it provides contractual basis for the more detailed technical requirements." [Craig Larman, Applying UML and Patterns]

Project vision can be one page document describing why is project being built, what are customer needs and what is the firm’s benefit.  Once project vision is clear to all team members, it will be easier for everyone to focus on project business value and everyone will be able to contribute to project. After all, we programmers are not there to put buttons on forms; our task is to solve customer’s problems.

Steve McConnell said it best in his book Code Complete :

"Programmers who remember to consider the business impact of their decisions are worth their weight in gold."

Project vision is so easy to create that it may seam to obvious, but what is obvious know won't be in one year, and what is obvious to project leader may not seam so obvious to rest of the team. Writing one page document to explain 6 months project to 20 people shouldn't be a problem.

I will be covering  other documentation types in next posts.

How should we organize software documentation?

In my previous blog post I've said something about the reasons of documenting software. Now the question is how to do it? People don't like to write documentation, documentation is often hard to find and maintain and it is usually obsolete. We need to organize documentation in such way that it is meaningful, easy to maintain, easy to find and hard to become obsolete. You can say that right solution is somewhere between full bureaucracy and anarchy.

Process oriented solution

Some people believe that with detailed organization and business processes they can completely controls their business. This approach is something like programming humans that are executed by business process. In this environment programmer would have to follow exact documenting process and typically fill big documentation templates with concrete information. This procedure can guarantee that people will avoid documenting at all costs and it doesn't guarantee that information populated in templates will be useful enough. It may give away false impression of fully controlling this vital activity but in practice it isn't so.

Programmers dislike (hate) writing software documentation because it is not interesting and creative. If they are further more crippled by heavy formal process, programmers will do what ever they can just to finish that part of work by filling templates with low quality and incomplete information.

If it is not easy for programmer to find and update some information he simply won’t do it (if he is not explicitly asked to do it). As software evolves and knowledge of the real system increases, documentation becomes obsolete because programmers are not updating it. After enough time has passed you are faced with massive obsolete documentation that in the end of the day is making more harm than good because you can never know if information is accurate.

People oriented solution

My opinion is that company Wiki site (e.g. ScrewTurn Wiki ) is much better solution for knowledge base than bunch of populated word templates resting on some server's file system. Pages on Wiki site are easier to search, easier to alter and they can automatically notify all participants when documentation changes. This approach can seam frightening to management because everyone can alter any document they want. In general, employees shouldn't be constrained in their work so that they can not make mistake. Not being able to make mistake means that you aren't doing any creative work and that you should be replaced with machine. Instead, company should introduce quality checks for activities that are more likely to introduce an error or where error can be very expensive. Some dedicated person (e.g. project leader) could inspect all documentation incremental changes in previous weak and verify them. Bear in mind that almost all Wikis will automatically store document versions and track changes. This way person who made mistake can always easily be identified and document can easily be rolled back to previous version.

After you created infrastructure for easy and fast document manipulation you still need to create firm climate (culture) to motivate people to contribute to knowledge base. This is the only way that firm knowledge will grow and knowledge will be effectively passed between coworkers. After all, what are knowledge workers without good knowledge base?

In my next post I will write about what should be documented and how, so if you are interested hook up to my blog feed and stay tuned.
What are your experiences with writing software documentation? 

Should software be documented?


First theme I will write about in this blog is software documentation. It feels like natural starting point or at least it should be. Documentation must exist in some shape and volume before you start with development. Documentation purpose is to capture knowledge, keep it from misconception and forgetting. It is important for all participants in software development life cycle, from customer, through PM, developers, tester to users. All of this sounds logical, we have all been taught these things.... so what's the problem?


The Problem


Problem with documentation is that its creation can be boring and documentation can easily become out of date and useless. Documentation also takes lot of time, which is taken from development.  After all, you can not compile and run your documentation. Because of this, people can easily forget importance of documenting one highly complex system, where misunderstanding is happening on every day basis, and will easily drop it in background when time becomes critical factor... and time is often critical. At the end of the day/deadline, if needed functionalities don't exist, existence of documentation is irrelevant. It is interesting that writing documentation is not problem only to developers urging for freedom and creativity, it is also problem to customer. Customer would usually say that of course he wants his software documented, but at the end of the day he usually won't be willing to participate in its creation, updating and clarification. 


The Need


Now that I covered some of the problems with documentation there is a question "Why should we bother with writing documentation?”. One of the main attributes of software is complexity. Software is usually built to automate some business processes and handle some system complexity transparently. We can say that there are several complexities in software development:

  1. Domain Complexity - natural complexity of domain for which software is made.
  2. Application Complexity - complexity of application design. This is artificial complexity that can be added for greater reusability, maintainability or can be added by accident.
  3. Technology Complexity - complexity of tools and technology used for development.


So we already have three sorts of complexity. That is a lot and only one of those can be pretty large. You should also note that domain can be fairly unknown to customer him self. He could also fail to explain domain to developers, or developers can misunderstand him.

Application complexity can be pretty complex as initial design on paper, with no coding. As system grows, requirements are changing, code is changing, as time runs out developers are hacking their way through code... In worst case, application complexity can reach so high level that no changes are possible and project is dropped.
As for technology complexity, we all know that technology is changing which doesn't mean it is advancing.


From all above we can conclude two things:

  1. Software development is hard.
  2. Software documentation is very much needed.


In next post I will be covering some advices on writing documentation so hook up to blog feed.