CHETHIYA ABEYSINGHE

@chethiyaa

Writing an LRU cache for a new language

Nov 05, 2019

LRU Cache

Caching is a quite useful concept, specially if you have relatively heavy computations that might not need to be done every single time. Among many different types of caches, LRU caches are a quite popular choice. The basic idea is to retain most recently used n outcomes that can fit inside the allocated memory for the cache. So those cached outcomes can be used to serve subsequent queries without recomputing those outcomes. This is quite a good fit for most practical use cases when:

  • same query is likely to occur in a reasonably small time window,
  • computed outcomes for queries are time sensitive. i.e. outcomes depends on the time it's computed.

I was recently asked to implement an LRU cache during a coding interview and I was quite happy that interviewers do ask non hypothetical questions too. Anyway this article is not about interviews. It's an LRU cache implementation for a new programming language called Ballerina.

Current cache implementation of Ballerina in its Standard Library

I wanted to give Ballerina a try for a long time, but I didn't get a chance. Finally I got some time to dig into their codebase and try it out. I was quite impressed by the fact that they have included an LRU cache implementation in their standard library and it's shipped together with the compiler. It looks like that caching implementation is consumed in few other places inside other standard libraries, like the HTTP module.

I looked at their caching implementation and found these issues with it:

  • Its runtime complexity is not the optimal for a LRU Cache. If size of the buffer is N and if there are M cache queries (assuming GET/SET ratio is constant) then its run time complexity is O(NM). The LRU Cache can be implemented to run in O(M).
  • It has a parameter called eviction factor f, and what it does is once the cache is full it removes N*f least recently used items. This behavior doesn't utilize allocated memory for the cache in 100%, but average utilization is something like (1-f)*100 + (f*100/2) percent.
  • It compares between all timestamps of the items to compute the least recently used N*f items to be evicted. But this depends on the precision of the timestamp. It currently stores timestamps in milliseconds. So if all items in the cache are accessed in the same millisecond, then above computation would be inaccurate.
  • It has a separate task running every 5 seconds to cleanup expired cache entries. The way it's implemented this also introduces a shared lock between all cache instances. Altogether this cleanup processes is unnecessary work as this check can be done only during the GET operation. Also it doesn't make much sense to do this regular cleanup work for saving memory. Usually LRU caches are expected to be full all the time. If not consumer should consider reducing the amount of memory (i.e. N) allocated for the cache.
  • It seems to be written in a non thread-safe manner. See here the comment I made in an issue.

Here how the current implementation takes O(NM) time:

Once cache is reached to it's capacity it computes Nf least recently used entries in cache. That's done in O(N*Nf), using 2 loops. This could have been done much faster using a heap, but that approach won't help us much to get the overall time complexity down to O(M)

Out of M cache access, if we assume the ratio between GET/SET operations is constant, then we can assume the above eviction is going to happen O(M/(N*f)) times. Note that the "Big O" eliminates that constant factor.

So total run time complexity is O((M/Nf) * (N*Nf))=O(MN).

How to do it in O(M)

We should be able to implement LRU cache so that its time complexity is independent from the cache size. Key idea to do so is to use a separate linked list to store entries sorted by their access time. Entries in the map will have a pointer to the corresponding item in the linked list, so that whenever you access an item, you can move that item to the front of the linked list. That way the least recently accessed item is alway the tail item of the linked list.

Here is a general explanation on how LRU caches are implemented.

You can find my LRU cache implementation on Ballerina here.

Few important points regarding this specific implementation:

  • It uses time:nanoTime() not because we need nanosecond precision. In fact their is no guarantee that underlying hardware clock has nanosecond accuracy. I'm using it simply because time:currentTime():time seems to be 10x slower than time:nanoTime().

By the way time:nanoTime() does not return the current timestamp since epoch date. It's just a wrapper to System.nanoTime in java which returns a time since some arbitrary reference time. It's safe to be used here as long as this program is not going to run 100 years continuously without a restart.

  • Right now Ballerina maps are thread-safe. So that synchronizations on map operations are redundant, as all LRU cache operations anyway need to be synchronized using lock.
  • I initially implemented CacheItem as a Ballerina Closed Record. But looking at Ballerina code, it looks like Records are implemented as Hashtables. If you look at caching implementation, attributes of CacheItems are accessed all the time. So having to access these attributes using a Hashtable doesn't make much sense. Ideally it should be implemented using something like C struct. Therefore I changed CacheItem implementation to use Ballerina Objects which is similar to Java classes.

Performance

Here's a comparison of the run time of the O(N) implementation and the O(NM) implementation in Ballerina Standard Library:

Capacity (N)O(M) Cache run-time (s)O(NM) Cache run-time (s)
511.87621.133
1015.76017.318
2013.68818.829
4016.18026.507
8017.27146.618
16017.50278.205
32017.851142.091
64018.129275.218
100018.435423.605

You can find the program used to measure these numbers here

Measured times are real elapsed time. Not CPU time, which I couldn't find a way to measure in Ballerina.

As you can see the run-time of the cache implementation in Ballerina Standard Library varies linearly as N grows. But stays quite stable for smaller N, specially for values smaller than 20 or so. I guess this is probably due to CPU caching that takes place because of all operations occur in a smaller array when it computes the evicting entries.

It looks like there's a slight increase in time as N increases for O(M) implementation. To verify this I tried with few different N and had following results:

Capacity (N)O(M) Cache run-time (s)
10210.470
10310.545
10413.518
10519.555
106 (with default heap size)GC overhead limit exceeded
106 (with 256MB)44.177
106 (with 512MB)21.131
106 (with 1024MB)18.156

For N=106 with the default heap size, it crashes giving following error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

I increased the Java heap size and got rest of the results after the failed test. Here we observe that as heap size increases, the runt-time performance also becomes better. I think this is due to the additional computations that need to be done when trying to allocate space for objects within a small heap vs doing that in a larger heap.

So the slight increment in run-time as N grows is due to additional memory allocations it has to do within the same heap size.

In my opinion it would have been nice if the compiler could handle 1 million records (60 bytes per each record) in a map within a 128MB heap. It's quite interesting to implement same Cache in Java and see whether it can handle 1 million items within 128MB. Anyway it's not a big issue whereas the language doesn't seem to worry about those aspects too much right now.

How to use

You can find my implementaion as a module at Ballerina Central.

You can pull it within your code by adding it as a dependency in your project by editing .toml file. e.g.

[dependencies]
"chethiya/cache" = "0.2.1"

Run following command to pull the module:

ballerina pull chethiya/cache

Now you are ready to use the LRUCache in your code. Simply import the module and create an instance of it in your .bal file. e.g.

import ballerina/http;
import ballerina/log;
import chethiya/cache;

// Keep 1000 search results and expire 1 mininutes after setting
// search results.
cache:LRUCache cachedResults = new(1000, 60000, false);

service hello on new http:Listener(9090) {
  resource function sayHello(http:Caller caller, http:Request req) {
    var query = getQuery(req);
    SearchResult? result = cachedResults.get(query);
    if (result is ()) {
      result = computeResult(query);
      cachedResults.put(query, result);
    }
    var res = caller->respond(result);
    if (res is error) {
        log:printError("Error sending response", result);
    }
  }
}

If you have any issue regarding this implementation, please report those here.

LRU means Least Recently Used. That's the eviction policy, the policy that is used to remove items from cache to make room for new items.

If outcomes are not time sensitive, then caching most frequently used outcomes might be a better choice.

You can find some examples here. See how easy it is to create a simple service with it here. It'll run in a pre allocated thread pool and you don't have to worry about settings those up.

So this implementation doesn't scale well when cache size grows. i.e. higher the number of outcomes you want to cache, slower it'll perform.

This is the total time taken for eviction, which is the dominant part of total time.

GeeksforGeeks is a really good resource if you want to revise/learn algorithms and data structures. Also it's a quite good resource if you are preparing for coding interviews.

time:currentTime() creates a record by allocating memory and right now it's a hashtable. time:nanoTime() just returns the integer so it is much faster.

I think the default heap size of Ballerina is 128MB

Do export JAVA_OPTS="-Xmx512m" to increase heap

A CacheItem takes ~60 bytes assuming strings have 4 bytes for each character.

It's a public package repository for Ballerina.

Note the initializer is little different in this compared to Stdlib Cache.

Note how it can expire entries based on time, only considering SET time irrespective of SET times.

### LRU Cache Caching is a quite useful concept, specially if you have relatively heavy computations that might not need to be done every single time. Among many different types of caches, LRU caches are a quite popular choice. The basic idea is to retain most recently used --n-- outcomes that can fit inside the allocated memory for the cache. So those cached outcomes can be used to serve subsequent queries without recomputing those outcomes. This is quite a good fit for most practical use cases when: >>> LRU means --Least Recently Used--. That's the eviction policy, the policy that is used to remove items from cache to make room for new items. * same query is likely to occur in a reasonably small time window, * computed outcomes for queries are time sensitive. i.e. outcomes depends on the time it's computed. >>> If outcomes are not time sensitive, then caching most frequently used outcomes might be a better choice. I was recently asked to implement an LRU cache during a coding interview and I was quite happy that interviewers do ask non hypothetical questions too. Anyway this article is not about interviews. It's an LRU cache implementation for a new programming language called <>. >>> You can find some examples <>. See how easy it is to create a simple service with it <>. It'll run in a pre allocated thread pool and you don't have to worry about settings those up. ### Current cache implementation of Ballerina in its Standard Library I wanted to give Ballerina a try for a long time, but I didn't get a chance. Finally I got some time to dig into their codebase and try it out. I was quite impressed by the fact that they have included an <> and it's shipped together with the compiler. It looks like that caching implementation is consumed in few other places inside other standard libraries, like the <>. I looked at their caching implementation and found these issues with it: * Its runtime complexity is not the optimal for a LRU Cache. If size of the buffer is ``N`` and if there are ``M`` cache queries (assuming GET/SET ratio is constant) then its run time complexity is ``O(NM)``. The LRU Cache can be implemented to run in ``O(M)``. >>> So this implementation doesn't scale well when cache size grows. i.e. higher the number of outcomes you want to cache, slower it'll perform. * It has a parameter called --eviction factor-- ``f``, and what it does is once the cache is full it removes ``N*f`` least recently used items. This behavior doesn't utilize allocated memory for the cache in 100%, but average utilization is something like ``(1-f)*100 + (f*100/2)`` percent. * It compares between all timestamps of the items to compute the least recently used ``N*f`` items to be evicted. But this depends on the precision of the timestamp. It currently stores timestamps in milliseconds. So if all items in the cache are accessed in the same millisecond, then above computation would be inaccurate. * It has a separate task running every 5 seconds to cleanup expired cache entries. The way it's implemented this also introduces a shared lock between all cache instances. Altogether this cleanup processes is unnecessary work as this check can be done only during the GET operation. Also it doesn't make much sense to do this regular cleanup work for saving memory. Usually LRU caches are expected to be full all the time. If not consumer should consider reducing the amount of memory (i.e. ``N``) allocated for the cache. * It seems to be written in a non thread-safe manner. See <> the comment I made in an issue. +++ **Here how the current implementation takes ``O(NM)`` time:** Once cache is reached to it's capacity it computes ``Nf`` least recently used entries in cache. That's done in ``O(N*Nf)``, using 2 loops. This could have been done much faster using a heap, but that approach won't help us much to get the overall time complexity down to ``O(M)`` Out of ``M`` cache access, if we assume the ratio between GET/SET operations is constant, then we can assume the above eviction is going to happen ``O(M/(N*f))`` times. Note that the "Big O" eliminates that constant factor. So total run time complexity is ``O((M/Nf) * (N*Nf))=O(MN)``. >>> This is the total time taken for eviction, which is the dominant part of total time. ### How to do it in ``O(M)`` We should be able to implement LRU cache so that its time complexity is independent from the cache size. Key idea to do so is to use a separate linked list to store entries sorted by their access time. Entries in the map will have a pointer to the corresponding item in the linked list, so that whenever you access an item, you can move that item to the front of the linked list. That way the least recently accessed item is alway the tail item of the linked list. +++ <> is a general explanation on how LRU caches are implemented. >>> GeeksforGeeks is a really good resource if you want to revise/learn algorithms and data structures. Also it's a quite good resource if you are preparing for coding interviews. You can find my LRU cache implementation on Ballerina <>. **Few important points regarding this specific implementation:** * It uses ``time:nanoTime()`` not because we need nanosecond precision. In fact their is no guarantee that underlying hardware clock has nanosecond accuracy. I'm using it simply because ``time:currentTime():time`` seems to be 10x slower than ``time:nanoTime()``. >>> ``time:currentTime()`` creates a record by allocating memory and right now it's a hashtable. ``time:nanoTime()`` just returns the integer so it is much faster. +++ By the way ``time:nanoTime()`` does not return the current timestamp since epoch date. It's just a wrapper to <> which returns a time since some arbitrary reference time. It's safe to be used here as long as this program is not going to run 100 years continuously without a restart. * Right now Ballerina maps are thread-safe. So that synchronizations on map operations are redundant, as all LRU cache operations anyway need to be synchronized using ``lock``. * I initially implemented ``CacheItem`` as a <>. But looking at Ballerina code, it looks like Records are implemented as Hashtables. If you look at caching implementation, attributes of ``CacheItem``s are accessed all the time. So having to access these attributes using a Hashtable doesn't make much sense. Ideally it should be implemented using something like C struct. Therefore I changed ``CacheItem`` implementation to use <> which is similar to Java classes. ### Performance Here's a comparison of the run time of the ``O(N)`` implementation and the ``O(NM)`` implementation in Ballerina Standard Library: ||| Capacity ``(N)`` | ``O(M)`` Cache run-time (s) | ``O(NM)`` Cache run-time (s) === 5 | 11.876 | 21.133 10 | 15.760 | 17.318 20 | 13.688 | 18.829 40 | 16.180 | 26.507 80 | 17.271 | 46.618 160 | 17.502 | 78.205 320 | 17.851 | 142.091 640 | 18.129 | 275.218 1000 | 18.435 | 423.605 --You can find the program used to measure these numbers <>-- +++ --Measured times are real **elapsed time**. Not **CPU time**, which I couldn't find a way to measure in Ballerina.-- As you can see the run-time of the cache implementation in Ballerina Standard Library varies linearly as ``N`` grows. But stays quite stable for smaller ``N``, specially for values smaller than 20 or so. I guess this is probably due to CPU caching that takes place because of all operations occur in a smaller array when it computes the evicting entries. It looks like there's a slight increase in time as ``N`` increases for ``O(M)`` implementation. To verify this I tried with few different ``N`` and had following results: ||| Capacity ``(N)`` | ``O(M)`` Cache run-time (s) === 10^^2^^ | 10.470 10^^3^^ | 10.545 10^^4^^ | 13.518 10^^5^^ | 19.555 10^^6^^ (with default heap size) | GC overhead limit exceeded 10^^6^^ (with 256MB) | 44.177 10^^6^^ (with 512MB) | 21.131 10^^6^^ (with 1024MB) | 18.156 For ``N``=10^^6^^ with the default heap size, it crashes giving following error: ``` java.lang.OutOfMemoryError: GC overhead limit exceeded >>> I think the default heap size of Ballerina is 128MB I increased the Java heap size and got rest of the results after the failed test. Here we observe that as heap size increases, the runt-time performance also becomes better. I think this is due to the additional computations that need to be done when trying to allocate space for objects within a small heap vs doing that in a larger heap. >>> Do ``export JAVA_OPTS="-Xmx512m"`` to increase heap So the slight increment in run-time as ``N`` grows is due to additional memory allocations it has to do within the same heap size. In my opinion it would have been nice if the compiler could handle 1 million records (60 bytes per each record) in a map within a 128MB heap. It's quite interesting to implement same Cache in Java and see whether it can handle 1 million items within 128MB. Anyway it's not a big issue whereas the language doesn't seem to worry about those aspects too much right now. >>> A ``CacheItem`` takes ~60 bytes assuming strings have 4 bytes for each character. ### How to use You can find my implementaion as a module at <>. >>> It's a public package repository for Ballerina. You can pull it within your code by adding it as a dependency in your project by editing ``.toml`` file. e.g. ``` [dependencies] "chethiya/cache" = "0.2.1" Run following command to pull the module: ``` ballerina pull chethiya/cache Now you are ready to use the ``LRUCache`` in your code. Simply import the module and create an instance of it in your ``.bal`` file. e.g. ```java import ballerina/http; import ballerina/log; import chethiya/cache; // Keep 1000 search results and expire 1 mininutes after setting // search results. cache:LRUCache cachedResults = new(1000, 60000, false); service hello on new http:Listener(9090) { resource function sayHello(http:Caller caller, http:Request req) { var query = getQuery(req); SearchResult? result = cachedResults.get(query); if (result is ()) { result = computeResult(query); cachedResults.put(query, result); } var res = caller->respond(result); if (res is error) { log:printError("Error sending response", result); } } } >>> Note the initializer is little different in this compared to Stdlib Cache. Note how it can expire entries based on time, only considering SET time irrespective of SET times. If you have any issue regarding this implementation, please report those <>. ####**<>**