Saturday 8 November 2014

Achieving Better Embedded Software Performance Through Memory Layout Optimization

Software Engineering for Embedded Systems, Mike Brogioli of Polymathic Consulting provides tips on various memory layout optimization techniques that can be used to improve embedded system performance.

In order to obtain sufficient levels of performance, application developers and software systems architects must not only select the appropriate algorithms to use in their applications, but also the means by which those applications are implemented. Quite often this also crosses the line into data structure design, layout and memory partitioning for optimal system performance.

It is true that senior developers often have insight into both algorithms and their complexity, as well as a toolbox of tips and tricks for memory optimization and data structure optimization. At the same time, the scope of most embedded software engineering projects prohibits manual code and data hand optimization due to time, resource and cost constraints.

As such, developers must often rely on the tools as much as possible to optimize the general use cases, only resorting to hand-level tuning and analysis to tweak performance on those performance-critical bottlenecks after the initial round of development.

This last round of optimization often entails using various system profiling metrics to determine performance-critical bottlenecks, and then optimizing these portions of the application by hand using proprietary intrinsics or assembly code, and in some cases rewriting performance-critical kernel algorithms and/or related data structures. This article follows on from the topics discussed in Part 1 and details design decisions that may prove useful for embedded system developers concerned with the issues those topics mentioned above.

Overview of memory optimization
Memory optimizations of various types are often beneficial to the run-time performance and even power consumption of a given embedded application. As was mentioned previously, these optimizations can often be performed to varying degrees by the application build tools such as compilers, assemblers, linkers, profilers and so forth. Alternatively, it is often valuable for developers to go into the application and either manually tune the performance or design in consideration of memory system optimization a priori, for either given performance targets or so as to design the software architecture to be amenable to automated-tool optimization in subsequent phases of the development cycle.

In tuning a given application, quite often the baseline or “out of box” version of the application will be developed. Once functionality is brought online, the development team or engineers may select to profile the application for bottlenecks that require further optimization. Often these are known without profiling, if certain kernels within the application must execute within a given number of clock cycles as determined by a spreadsheet or pen and paper exercise during system definition. Once these key kernels are isolated or key data structures are isolated, optimization typically begins by those experts with knowledge of both software optimization techniques, compiler optimizations, the hardware target and perhaps details of the hardware target instruction set.

Focusing optimization efforts
Amdahl’s law plays an interesting role in the optimization of full application stacks, however, and is not always appreciated by the software system developer. If only 10% of the dynamic run-time of a given application can benefit from SIMD or instruction-level parallel optimizations versus the 90% of dynamic run-time that must be executed sequentially, then inordinate amounts of effort on parallelizing the 10% portion of the code will still only result in modest performance improvements. Conversely, if 90% of the total application’s dynamic run-time is spent in code regions exhibiting large amounts of instruction-level parallelism and data-level parallelism, it may be worthwhile to focus engineering effort on parallelizing these regions to obtain improved dynamic run-time performance.

In determining those portions of the code which dominate the dynamic application run- time, and may be the best candidates for either hand optimization or hand adjustment for applicability to automated-tool optimization, application developers typically use a software profiler in conjunction with either the silicon target or software-based system simulation. Intel’s VTUNE is one such example of a profiling framework; alternatively the GNU GCC compiler and GPROF are open-source solutions that provide dynamic run-time information. Many silicon vendors such as Freescale Semiconductor and Texas Instruments also offer their own proprietary solutions for use with their respective silicon targets, allowing for either traces collected on software-based simulation platforms, or alternatively larger application-level traces that can be collected on the native silicon target.

Vectorization and the dynamic code: compute ratio
Vectorization of loops is an optimization whereby computation performed across multiple loop iterations can be combined into single vector instructions, effectively increasing the instruction-to-compute ratio within the application’s dynamic run-time behavior. Consider the example in Figure 5.

     short a[16], b[16], c[16];
     for(iter=0; iter<16; ++iter)
     {
          // results in single 16-bit MPY instruction
          // generated in assembly listing
          //

          a[iter] = b[iter] * c[iter]
     }

     short a[16], b[16], c[16]; for(iter=0; iter<16 iter+=4)
     {
          // with high level compiler vectorization,
          // results in 4-WAY parallel 4x16-BIT multiply
          // vector instruction, effectively performing the
          // computation of four iterations of the loop in
          // a single atomic SIMD4 instruction.
          //

          a[iter:iter+4] = b[iter:iter+4] * c[iter:iter+4];
     }


Figure 5: Loop level vectorization example

In the first loop nest, we can see that each iteration of the loop contains a single 16-bit by 16-bit multiply instruction whose result is written to the a[] array as output. One multiplication instruction is performed for each iteration of the loop, resulting in 16 16-bit multiplications. The second loop, however, shows pseudocode for how the compiler or application developer might vectorize the loop when targeting an architecture that supports a four-way SIMD multiply instruction over 16-bit integer elements.

In this case, the compiler has vectorized multiple iterations of the loop together into the multiply instruction, as denoted by the array[start_range:end_range] syntax denoted in the second loop nest. Note that the loop counter is incremented by the vectorized length for each iteration of the loop now. Clearly only four iterations over the loop are now needed to compute the resulting an output array, as each iteration of the loop now contains a single vector multiply instruction that computes four elements of the output vector in parallel.

There are many benefits to vectorizing code in this manner, either by hand if the application developer uses intrinsics that are proprietary with respect to the target architecture, or if the compiler is able to vectorize the code. One such benefit is the increase in performance, as the code now exploits dedicated SIMD hardware, often providing a multiplication in improvement over the vectorized loop on the order of the underlying SIMD vector hardware.

Other benefits are the reduction in code size, as loops are no longer unrolled resulting in explosions in the code size, but rather more dense instructions of vector format are used rather than atomic scalar instructions. This may have secondary benefits in reducing the number of instruction fetch transactions that go out to memory as well. Lastly, the overall ratio of dynamically issued instructions to computation performed within the application is increased as well.

There are a number of challenges to both the development tools and the application developers when trying to vectorize code at the loop level. One such challenge is the code shape of loop nests that are candidate for vectorization. Typically, build tools need to understand the loop iteration space of a loop, so using constant loop bounds rather than run- time computed values may be beneficial depending on the advancement of the underlying compiler’s vectorization technology.

Secondly, the types of computation performed within the loop nest must be amenable to vectorization. For example, in the example above simple 16-bit integer multiplication is performed for a target architecture supporting a supposed 16-bit four-way SIMD multiply instruction. If the underlying target architecture only supports 8-bit SIMD multiplication, it may be advantageous to avoid 16-bit multiplication wherever possible if vectorization is desired.

Loop dependence analysis is another concern when vectorizing or parallelizing loop nests, as the compiler must be able to prove the safety of loop transformations. Loop dependence analysis is the means by which the compiler or dependence analyzer determines whether statements within a loop body form a dependence with respect to array accesses and data modifications, various data reduction patterns, simplification of loop-independent portions of the code and management of various conditional execution statements within the loop body.

As an example, consider the fragment of C-language code in Figure 6.

     for(iter_a=0; iter<LOOP_BOUND_A; ++iter_b)
        for(iter_b=0; iter_b<LOOP_BOUND_B; ++iter_b)
           a[iter_a+4-iter_b] =
              b[2*iter_a-iter_b]+ iter_a*iter_b;


Figure 6: Fragment of C language code

For the loop above, the compiler’s data dependence analyzer will attempt to find all dependences between the statements reading the array b[] and writing to the array a[]. The challenge for the data dependence analyzer is to find all possible dependences between the statements that write to array a[] and read from array b[]. To ensure safety, the data dependence analyzer must ensure that it can explicitly prove safety or, in other words, any dependence that cannot be proven false must be assumed to be true to ensure safety!

The data dependence analysis shows independence between references by proving that no two instances of statements to array a[] and array b[] access or modify the same spot in array a[]. In the event that a possible dependence is found, loop dependence analysis will make an attempt to characterize the dependences, as some types of optimizations over loop nests may still be possible and profitable. It may also be possible to further transform the loop nests so as to remove the dependence.

In summary, writing loop nests so that a minimum of data dependencies exists between array references will benefit vectorization and other loop transforms as much as possible. While the compiler technology used in analyzing data dependencies and autovectorizing serial code for vector hardware stems from the supercomputing community, improperly written code with troublesome data dependencies and loop structure may still thwart the vectorization efforts of the most advanced tool sets.

At a high level, simply writing code which is easiest for humans to understand usually produces code that is easiest for the vectorizer to understand as well, as the vectorizer and data dependence analyzers can easily recognize what the programmer intended. In other words, highly hand-tuned code with a priori knowledge of the underlying target architecture is not the best candidate for automated vectorization at the tools level.

There are a number of things that application developers may want to keep an eye out for when developing code with the intent of autovectorization by the build tools.

Pointer aliasing in C
One challenge for vectorizers and data dependence analysis is the user of pointers in the C language. When data is passed to a function via pointers as parameters, it is often difficult or impossible for the data dependence analyzer and subsequent vectorizer to guarantee that the memory regions pointed to by the various pointers do not overlap in the interaction spaces of the loops in which they are computed. As the C standard has evolved over time, support for the “restrict” keyword has been added, as can be seen in the example in Figure 7.

     void restrict_compute(restrict int *a, restrict int
     *b, restrict int *c)
     {
          for(int i=0; i<LIMIT; ++i)
               a[i] = b[i] * c[i];
     }


Figure 7: C Standard “restrict” keyword

By placing the restrict keyword qualifier on the pointers passed to the procedure, this ensures to the compiler that the data accessed by a given pointer with the restrict keyword does not alias with anything else the function may modify using another pointer. Note that this only applies to the function at hand, not the global scope of the application itself. This permits the data dependence analyzer to recognize that arrays are not aliased or modified by references with other side effects, and allows more aggressive optimization of the loop nest including vectorization amongst other optimizations.

Source:-http://www.embedded.com/design/debug-and-optimization/4436459/Achieving-better-embedded-software-performance-through-memory-layout-optimization-Part-2                        

No comments:

Post a Comment