Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Cache coherency in RingIO

From Texas Instruments Wiki
Jump to: navigation, search

Cache coherency in RingIO

Overview

RingIO is one of the more complex methods the SysLink software product provides for passing data between processors. This document describes cache-related rules and guidelines when using RingIO. For reference, the RingIO API Reference Guide is here.

There are three shared buffers owned by any given RingIO instance:

  • The control buffer, which contain state information about a RingIO instance
  • The attribute buffer, which stores the attributes set by the writer
  • The data buffer, which is the ring buffer that holds the data exchanged between the writer and the reader.

Each of these buffers may reside in a memory region that is cached. Cache coherency of the control buffer is automatically maintained by the module since the buffer is purely used by the module internally. Coherency of the attribute buffer is similarly maintained by the RingIO module since attributes are physically copied into and out of the attribute buffer during write and read operations. Where things get tricky is when dealing with the data buffer, as the user application has direct access to this buffer. Care must be taken to avoid data corruption introduced due to cache usage.

Note that our discussion takes aim at the case where the data buffer is in cached memory. If it resides in non-cached memory then cache coherency operations are not necessary. Simply clear the bit 'RingIO_DATABUF_MAINTAINCACHE' in the 'flags' field of the RingIO_openParams structure passed to RingIO_open(), and the user need not worry about this discussion.

In cases where the reader and writer reside on different cores, it is possible that only one client operates on cached memory. In such cases follow the rules in the discussion only for the side where cache is enabled.

Supported use-cases

Here are the rules to follow when using RingIO with the data buffer in cached memory:

1. Work using cache-line-multiple sizes. This applies to all RingIO_acquire() and RingIO_release() function calls, and to the size of the data buffer itself. It is by far the *cleanest* use-case, where the reader and writer can maintain cache coherency within their acquired buffers with no side-effects on adjacent data.

If #1 is not possible, and arbitrary buffer sizes need to be acquired/released:

2. When reader and writer are on the same core (i.e. sharing a common cache), turn the 'RingIO_DATABUF_MAINTAINCACHE' bit off for both clients. The application can then manage the cache by itself, which is simpler when a single core is involved.

3. When reader and writer are on different cores, turn the 'RingIO_DATABUF_MAINTAINCACHE' bit on for the clients that accesses the data buffer using the CPU. Ideally, the writer should access the RingIO buffers it acquires using CPU-only or without using the CPU at all, but not both. Otherwise, it would need to ensure RingIO_acquire() and RingIO_release() are called with sizes that is a multiple of cache line size at all times.

4. When in-place processing is performed in the reader, in addition to rule 3, here are some additional rules when reader and writer are on different cores:

  • The writer must call RingIO_acquire() and RingIO_release() with cache-aligned sizes at all times.
  • If the reader accesses acquired RingIO buffers directly using the CPU, it needs to call RingIO_acquire() with multiple-of-cache-line sizes at all times as well.

If both reader and writer are on the same core, you should do at least one of the following beyond rule 2:

  • Stick to using strictly CPU-only accesses or non-CPU accesses when accessing the data buffer
  • Call RingIO_acquire()/RingIO_release() with sizes that are multiple of the cache line size.

5. When using a footbuffer, in addition to rules 2, 3, and 4, the following additional conditions must be satisfied:

  • For reader and writer on same core: if either client accesses the data buffer using non-CPU accesses, both reader and writer must call RingIO_acquire()/RingIO_release() with sizes that are a multiple of cache line size, and the application must perform cache writeback and invalidation of the acquired buffer prior to the non-CPU access. Given this may be restrictive, consider leaving the data buffer in non-cached memory instead as an alternative.
  • For reader and writer on different cores: if cache is enabled on the reader's core, create a data buffer of a size that is multiple of cache line size, and do *not* configure the writer with USE_EXACT_SIZE.

6. Exercise caution when performing cache operations in the application on an acquired buffer when not working with multiple of cache line sizes. They should be avoided whenever possible. If they are absolutely necessary, think through what you are doing as it may impact other acquired buffers in either client sharing the same cache line(s).


If you are interested in knowing how these rules came about, continue on with the reading.

What does RingIO do internally to assist with cache coherency of the data buffer?

When the bit 'RingIO_DATABUF_MAINTAINCACHE' in the 'flags' field of the RingIO_openParams structure passed to RingIO_open() is set, RingIO performs the minimal/basic cache operations to ensure writer and reader can reside on separate cores. Only two functions in the RingIO API perform cache operations on the data buffer:

  • RingIO_acquire(), when called by the reader client
  • RingIO_release(), when called by the writer client

RingIO_acquire() (when called by the reader) performs cache invalidation on the read portion of the buffer prior to returning it to the application, whereas RingIO_release() (when called by the writer) performs cache writeback and invalidation on the written portion of the buffer to commit the data to shared physical memory.

Moreover, it performs supplementary cache coherency operations pertaining to the use of a footbuffer when the latter is used. Refer to the discussion regarding the footbuffer for more details.

Any additional cache coherency operations, if required, become the application's responsibility.

Best practice

To keep things simple, it is best to use data sizes that are multiples of the cache line size in RingIO_acquire() and RingIO_release() calls. For example, if the cache line size is 128 bytes, then RingIO_acquire()/RingIO_release() should be called with sizes that are multiple of 128, such as 256 or 1024. This is because cache operations are performed on entire cache lines, hence when cache invalidation/writeback is done it affects any data that is sharing the same cache line(s) as the region that is being acquired/released. Furthermore, in cases where the application wants to perform cache maintenance on acquired buffers, it becomes more straightforward to perform cache maintenance on a per-buffer basis.

If the cache line size differs for the two clients - e.g. it is 128 on the writer's core and 1024 on the reader's core, use the least common multiple for the two sides, which would be 1024 for this example.

Guidelines when interacting with the data buffer when data sizes cannot be a multiple of cache line size (no in-place processing on reader side)

In cases where non-cache-line-multiple sizes are used, the application needs to be careful when calling RingIO_acquire() (from reader) and RingIO_release() (from writer). It is important to make sure the 'RingIO_DATABUF_MAINTAINCACHE' bit is correctly set when opening the writer and/or reader. Let's first examine what happens in the standard case where the reader only reads from the ring buffer without doing any in-place processing that requires writing to it.

Remote case: writer and reader on different cores not sharing the same cache

In the remote case, the writer performs RingIO_release() on one core and the reader performs RingIO_acquire() on another core. Let us consider the most generic case where the writer has acquired a small buffer that is smaller than the cache line size, and that this buffer is somewhere in the middle of the cache line, so there is space that is not part of the buffer both at the beginning and the end of the cache line.


Fig. 1 Example scenario where cache coherency could be problematic

Fig. 1

When the buffer is released by the writer, cache writeback is performed on the cache line when the 'RingIO_DATABUF_MAINTAINCACHE' bit is set for the writer client. The content that is outside of the buffer residing on the same cache line is written back as well. The space before/after the buffer is either 1. acquired by the reader 2. acquired by the writer or 3. not acquired by either reader or writer.

1. If space is acquired by reader, it may only read from it (assuming no in-place processing is done. In-place processing on reader side will be discussed later in this chapter.). This data was obviously released previously by the writer, hence it must have been written back and is coherent on the writer side. So no issue here with writing back valid data once more.

2. If space is acquired by the writer itself, the cache writeback may corrupt data that the writer has already placed in there (e.g. via DMA). Hence it is the responsibility of the writer to ensure cache coherency here with respect to itself. Ideally, it should access the RingIO buffers it acquires using CPU-only or without using the CPU at all, but not both. Otherwise, it would need to ensure RingIO_acquire() and RingIO_release() are called with sizes that is a multiple of cache line size at all times. And in the event the writer only accesses the data buffer using the DMA, the user should consider putting the data buffer in non-cacheable memory (from writer's side) altogether for best performance.

3. If space is not yet acquired by either the reader or writer, it may either contain valid data or be empty. If it is empty, no issue with the writeback. If it contains valid data, since the reader has not yet acquired and used the space, there is no issue to writeback a cache line that is coherent.

Now let's examine the case where the reader is acquiring a small buffer in the middle of a cache line. Cache invalidation is automatically performed on the entire cache line. The space is is not owned by the buffer is either 1. acquired by the writer 2. acquired by the reader or 3. not acquired by the either reader or writer.

1. If space is acquired by writer, it may be working on it. But there is no issue since the invalidation does not affect the writer side.

2. If space is acquired by reader, then the reader may be in the midst of reading valid data from it. Again no issue here except for a minor performance degradation if the reader reads from a cached location that got invalidated indirectly.

3. If space is neither yet acquired by the reader nor the writer, then invalidation of the empty/valid space is not an issue.

Hence the cache invalidation is not problematic in this case.

Local case: writer and reader on same core (or cores sharing the same cache)

In the local case, the writer performs RingIO_release() on the same core as where the reader performs RingIO_acquire() (or at least both share the same cache). If both writer and reader strictly access the data buffer using the CPU-only or without using the CPU at all, then the 'RingIO_DATABUF_MAINTAINCACHE' bit should be cleared. This is because cache maintenance operations are not necessary in either case.

If there is a mixture of accesses, it'd be best to leave the bit cleared as well. The application would then have to manage cache coherency on its own, just like it would normally do when CPU and DMA accesses are performed outside of the context of RingIO. As this cache coherency maintenance can get quite complex and costly (in some cases synchronization between reader and writer may be unavoidable), it is best to avoid this scenario whenever possible, or simplify things by having RingIO_acquire() and RingIO_release() operate on multiple of cache line size at all times, as described in the Best practice section.

Guidelines when there is in-place processing on reader side

In general, in-place processing leads to more complex cache coherency issues, hence it should be avoided unless absolutely necessary. When reader does in-place processing and performs write access to acquired buffers, additional requirements are needed to ensure cache coherency on both sides.

Remote case: writer and reader on different cores not sharing the same cache

In the remote case, the writer performs RingIO_release on one core and the reader performs RingIO_acquire on another core. Let us consider the extreme case where the writer has acquired a small buffer that is smaller than the cache line size, and that this buffer is somewhere in the middle of the cache line, so there is space that is not part of the buffer both at the beginning and the end of the cache line. When the 'RingIO_DATABUF_MAINTAINCACHE' bit is set for both clients and the buffer is released by the writer, cache writeback is performed on the cache line. The content that is outside of the buffer residing on the same cacheline is written back as well. The space before/after the buffer is either 1. acquired by the reader 2. acquired by the writer or 3. not acquired by either reader or writer.

1. If space is acquired by reader, it may be working on it. When the RingIO_release() call does a cache writeback, it may corrupt the data on the reader side. Hence the writer must call RingIO_release() with cache-aligned sizes at all times to avoid this condition. RingIO_acquire() should also be called with cache-aligned sizes as cache lines may be written back by the system during routine cache eviction.

2. If space is acquired by the writer itself, then the reader cannot affect this cache line via in-place processing, since it is effectively owned by the writer.

3. If space is not yet acquired by either the reader or writer, it may either contain valid data or be empty. If it is empty, no issue with the writeback. If it contains valid data, since the reader has not yet acquired and used the space, there shouldn't be any effect introduced by in-place processing in the reader.

Now let's examine the case where the reader is acquiring a small buffer in the middle of a cache line. Cache invalidation is automatically performed on the entire cache line. The space is is not owned by the buffer is either 1. acquired by the writer 2. acquired by the reader or 3. not acquired by the either reader or writer.

1. If space is acquired by writer, it may be working on it. But the invalidation does not affect the writer since it is not sharing the same cache.

2. If space is acquired by reader, then the reader may have written data to it. Then the invalidation might invalidate the cached data! Hence it is the responsibility of the reader to ensure cache coherency here with respect to itself, so that it does not inadvertently invalidate data it is working on when calling RingIO_acquire. If it accesses acquired RingIO buffers directly using the CPU, it would need to write back the data prior to doing a new RingIO_acquire call. Alternatively, it would also be safe if it ensures RingIO_acquire() is called with sizes that is a multiple of cache line size at all times.

3. If space is not yet acquired by the reader nor the writer, then invalidation of the empty/valid space is not an issue.

After acquiring the space, however, the reader may perform in-place processing and write to the acquired area using the CPU. In this case, the cache becomes dirty and will eventually be written back. During the writeback, if the writer has acquired and written to an area sharing the same cache line as the data being written back, data corruption may result. Hence, long story short: the lesson here is that the reader too has to call RingIO_acquire with multiple-of-cache-line sizes, if it uses the CPU to write to the acquired buffers.

Local case: writer and reader on same core (or cores sharing the same cache)

In the local case, the writer performs RingIO_release on the same core as where the reader performs RingIO_acquire (or at least both share the same cache).

As in the case where there is no in-place processing, the recommendation is to avoid setting the 'RingIO_DATABUF_MAINTAINCACHE' bit when opening the clients. The application would then manage cache coherency on its own, if required. Again keep things simple by sticking to strictly CPU-only access or non-CPU access, and/or by calling RingIO_acquire()/RingIO_release() with sizes that are multiple of the cache line size. In the first case, no cache coherency operations are needed, while in the second case they are certain to be safe.

Cache coherency when foot buffer is used

When a RingIO instance is configured to use a foot buffer, RingIO_acquire on the reader side may internally copy data from the top of the data buffer to the footbuffer. When that happens, and the 'RingIO_DATABUF_MAINTAINCACHE' bit for the reader is set, RingIO_acquire() performs the following sequence of operations:

  • invalidate the top of the data buffer
  • if the destination address does not start on a cache line boundary, invalidate the full cache line containing the destination address
  • perform the copy using the CPU
  • write the footbuffer contents from cache back to memory and invalidate its cached counterpart.

Let's examine the different scenarios we have covered so far, but with the addition of the footbuffer.

Both writer and reader are local and share the same cache

In this case, our recommendation was to turn the 'RingIO_DATABUF_MAINTAINCACHE' bit off for both reader and writer. So the cache maintenance will not be done. One important thing to pay attention to here is that the read from the top of the data buffer will cause the content of external memory to be pulled into the cache, and the write to the footbuffer area will go into the cache rather than in external memory. So if the writer and/or reader accesses the data buffer using non-CPU accesses, it would require the application to perform cache writeback and invalidation of the accessed area prior to the data transfer.

However, this cache maintenance leads to other problems. For instance, think of the case when the reader writebacks and invalidates an area in the footbuffer that shares a cache line with an acquired buffer in the writer. The writeback may still result in corruption for the writer's buffer if the latter's content changes between the time the footbuffer is copied and when it is written back by the application.

The solution in this case is to either work with sizes that are a multiple of the cache-line size, or of course to put the data buffer into non-cacheable memory if that is an option.

The writer and reader are on different cores

In this case, our recommendation was to turn the 'RingIO_DATABUF_MAINTAINCACHE' bit on for both reader and writer, provided the cache is turned on for both clients. In particular, if the cache is enabled on the reader side, the cache maintenance would be performed during the call to RingIO_acquire() on the reader core. In some cases, the cache operations may affect neighboring cache lines. Unless the user follows certain rules, writing back data from cache to external memory may cause stale data to be flushed out to an area acquired by the writer, or by the reader doing in-place processing with non-CPU accesses. So it is important that the cache writeback occurs on cache line boundary. If reader and writer work with arbitrary sizes, the only way to ensure a safe writeback would be

1. The total data buffer size must be a multiple of cache line size;

2. The footbuffer cannot be copied while there is an early end condition introduced by specifying 'RingIO_NEED_EXACT_SIZE' on the writer side.

To understand why this is necessary, look at the following figure:


Fig.2 The benefit of ensuring data is copied to foot buffer on cache boundary Fig.2 The benefit of ensuring data is copied to foot buffer on cache boundary

By following 1 and 2, we ensure the writeback occurs on a cache line boundary as illustrated in the bottom part of the diagram.

We are less concerned regarding the cache invalidation operations, as the reader typically reads from the buffer in normal operation. If in-place processing is done on the reader side, the in-place processing rules discussed previously would need to be followed.

Summary

So to reiterate, based on our analysis, here are the rules you should follow when cache is enabled for the data buffer (keep in mind that using non-cached memory would lift these restrictions, so that is always an option):

1. Work using cache-line-multiple sizes. This applies to all RingIO_acquire() and RingIO_release() function calls, and to the size of the data buffer itself. It is by far the *cleanest* use-case, where the reader and writer can maintain cache coherency within their acquired buffers with no side-effects on adjacent data.

If #1 is not possible, and arbitrary buffer sizes need to be acquired/released:

2. When reader and writer are on the same core (i.e. sharing a common cache), turn the 'RingIO_DATABUF_MAINTAINCACHE' bit off for both clients. The application can then manage the cache by itself, which is simpler when a single core is involved.

3. When reader and writer are on different cores, turn the 'RingIO_DATABUF_MAINTAINCACHE' bit on for the clients that accesses the data buffer using the CPU. Ideally, the writer should access the RingIO buffers it acquires using CPU-only or without using the CPU at all, but not both. Otherwise, it would need to ensure RingIO_acquire() and RingIO_release() are called with sizes that is a multiple of cache line size at all times.

4. When in-place processing is performed in the reader, in addition to rule 3, here are some additional rules when reader and writer are on different cores:

  • The writer must call RingIO_acquire() and RingIO_release() with cache-aligned sizes at all times.
  • If the reader accesses acquired RingIO buffers directly using the CPU, it needs to call RingIO_acquire() with multiple-of-cache-line sizes at all times as well.

If both reader and writer are on the same core, it is recommended to do at least one of the following beyond rule 2:

  • Stick to using strictly CPU-only accesses or non-CPU accesses when accessing the data buffer
  • Call RingIO_acquire()/RingIO_release() with sizes that are multiple of the cache line size.

5. When using a footbuffer, in addition to rules 2, 3, and 4, the following additional conditions must be satisfied:

  • For reader and writer on same core: if either client accesses the data buffer using non-CPU accesses, both reader and writer must call RingIO_acquire()/RingIO_release() with sizes that are a multiple of cache line size, and the application must perform cache writeback and invalidation of the acquired buffer prior to the non-CPU access. Given this may be restrictive, consider leaving the data buffer in non-cached memory instead as an alternative.
  • For reader and writer on different cores: if cache is enabled on the reader's core, create a data buffer of a size that is multiple of cache line size, and do *not* configure the writer with USE_EXACT_SIZE.

6. Exercise caution when performing cache operations in the application on an acquired buffer when not working with multiple of cache line sizes. They should be avoided whenever possible. If they are absolutely necessary, think through what you are doing as it may impact other acquired buffers in either client sharing the same cache line(s).