Cache Management

From Texas Instruments Wiki
Jump to: navigation, search

What Should I know about Cache Coherence?

The following are common customer issues in DaVinci and OMAP environments, though they're present in any multi-core system which utilizes cache.

General Conventions

  • Definitions:
    • cache writeback - Wb
    • cache invalidate - Inv
    • cache writeback-invalidate - WbInv
  • The application and/or framework manage cache. The XDAIS algorithm does not directly manage cache, although it may indicate how it accessed memory so the app/framework know how to behave.
  • An XDAIS algorithm pre-condition for cached buffers, is that all buffers it receives (IN, OUT, and IN/OUT) must be cache-evicted. Stated another way, the physical memory must contain the buffer's "latest" contents, and the cache must be Inv'd.

Input Buffers (application captures/generates the buffer and passes it to the algorithm)

  • Before calling into an algorithm (e.g. process(), control(), any IMOD fxn):
    • The application (e.g. GPP app) must make the physical memory coherent with any cache before each process()/control() call. It must also Inv any input buffers. (Note that if any buffers are not cached to the application, these cache operations are not needed.)
      • If the input buffer was filled with CPU writes, the buffer must be WbInv.
      • If the input buffer was filled by HW that doesn't go through cache (e.g. DMA writes by a driver), the buffer need only be Inv.
    • Note that Codec Engine doesn't perform any of these steps in any of the VISA stubs. Primarily, this is because Codec Engine doesn't know how the buffer was filled, so it can't know the Right Thing to Do.
      • If this is not done correctly, the remote algorithm CPU/DMA accesses may access incoherent data in external memory, with no ability to Wb the application processor's cache.
    • If the algorithm runs on a different processor than the app, the algorithm processor must Inv the input buffers before each process()/control() call.
    • Note that Codec Engine performs this Inv in the VISA skeletons.
      • If this is not done correctly, the algorithm's CPU reads may obtain stale data from its cache.
  • After the call into the algorithm returns, input buffers need no further cache maintenance. The rationale is that input buffers are read-only to the algorithm, so there will be no writes in the cache lines.

Output buffers (algorithm processes the input buffers and generates output buffers for the application)

  • Before calling into an algorithm, the application must Inv any output buffers. (Again, if any buffers are not cached to the application, these cache operations are not needed.)
  • After the algorithm returns:
    • The application must make the output buffers coherent with respect to physical memory and cache.
      • If it's unknown how the algorithm accessed the buffer, it's always safe (though possibly unnecessary) to Wb the output buffer. If the algorithm filled the output buffer with CPU writes, the Wb is necessary. If it was filled via HW (e.g. DMA), the Wb will do nothing since the buffer was Inv before the call, and there will be no dirty cache writes.
      • In XDM 1.x, the XDM1_SingleBufDesc data type (and by extension XDM1_BufDesc) provides an .accessMask field describing how the algorithm accessed the buffer. The application can utilize this field to avoid unnecessary Wb.
      • If the algorithm runs on a different processor than the application, the algorithm processor must also Wb the output buffers to provide them to the application processor.
        • Note that, as above, if the algorithm processor knows the algorithm didn't write to the output buffer using CPU (e.g. via XDM1_SingleBufDesc.accessMask), it can avoid this unnecessary Wb.
        • Note that Codec Engine performs this Wb in the remote processor's VISA skeletons. And in the XDM 1.x VISA skeletons, it uses the .accessMask to determine whether the Wb is needed.
      • Note that, if the algorithm runs on a remote processor, the application's Wb call will be a no-op since no cache lines will be dirty.
      • If this is not done correctly, the application may access stale data resident in its cache.

Input/Output Buffers (application and/or algorithm can read and/or write to the buffer)

  • Generally,
    • Before calling the algorithm, manage the buffers like IN buffers.
    • After calling the algorithm, manage the buffers like OUT buffers.

DMA-Related

If the GPP or DSP uses DMA to access the shared buffers, then there is more work to ensure coherence. XDAIS provides some Rules for frameworks: DMA Rules 6, 7 and 8. (See: http://www-s.ti.com/sc/techlit/spru352)

  • C6000 algorithms must not issue any CPU read/writes to buffers in external memory that are involved in DMA transfers. This also applies to the input buffers passed to the algorithm through its algorithm interface.
  • All "application side input and output buffers" that are passed as arguments to the algorithm process functions must be:
    • aligned on a cache line boundary (128-byte cache line size for L2 Cache on C6000) and
    • be a multiple of the cache line length in size.

Note that: if these alignment and size constraints are violated, then any data object allocated adjacent to the the application buffer will be sharing a cache line with a portion of the app-buffer. This has the potential to corrupt the portion of the app buffer residing in the shared cache-line. Corruption may happen as a side-effect of Cache Controller writing back the shared cache line to evict it while it was dirty due to (valid) CPU write accesses to the non-app-buffer data objects. This would be a very nasty bug to track!

Common Cache errors

Finally, for completeness, a common error frameworks employ when dealing with cache coherence issues is when:

  • (1) A WbInv is done for "input" buffers by the remote processor - instead of just Inv - "before" process/control calls:
    • If any of the "current" input buffers have been referenced in a "previous" process/control call, then a stale fragment of that buffer may already be resident in the Cache. Any Wb will corrupt the "current" input buffer with stale data from cache!
  • (2) Doing a blind "ALL L2 Cache" WbInv, instead of just the algorithm's own input/output buffers:
    • This will also potentially create problems for other algorithm instances, whose input/output buffers will be affected as in (1).
  • (3) Inv all of L2 will cause any data in the cache to be discarded. This will likely result in data being thrown away that shouldn't be. Additionally, this will severely degrade performance, for all algorithm instances, due to the resulting thrashing cache misses.

Also please note that the DSP-side L2 Cache line size on DM644x is 128-bytes.

Codec Engine Cache Maintenance

There are some details described above for how some parts of Codec Engine (specifically, the VISA Server-side skeletons) manage cache. This section consolidates some of that information and presents it for the Codec Engine user.

Codec Engine's only cache management is done in the Server-side skeletons - that is, on the remote processor, when managing data buffers for remote algorithms. Codec Engine never manages the cache for application side buffers and/or local algorithms; this is always the responsibility of the application. See the sections above for some reasons why generic frameworks can't do this right in all cases.

In Codec Engine environments, the following general, HW-agnostic statement can be made:

The application must manage the application-side processor's cache for any data buffers that are cached to the application processor.

Historically, the ARM-side data buffers (typically acquired from CMEM or video drivers) are not cached to the application processor, so the applications didn't have to manage it. This general lack of app exposure to cache details is why it's commonly mis-perceived that "Codec Engine handles cache". This is a mis-perception, CE doesn't, in general, handle cache for the application. However, CE will take care of the cache if these data buffers end up being given to remote algs.

See Also