Software Engineering for Embedded Systems,
Mike Brogioli of Polymathic Consulting provides tips on various memory
layout optimization techniques that can be used to improve embedded
system performance. In order to obtain sufficient levels of
performance, application developers and software systems architects must
not only select the appropriate algorithms to use in their
applications, but also the means by which those applications are
implemented. Quite often this also crosses the line into data structure
design, layout and memory partitioning for optimal system performance.
It
is true that senior developers often have insight into both algorithms
and their complexity, as well as a toolbox of tips and tricks for memory
optimization and data structure optimization. At the same time, the
scope of most embedded software engineering projects prohibits manual
code and data hand optimization due to time, resource and cost
constraints.
As such, developers must often rely on the tools as
much as possible to optimize the general use cases, only resorting to
hand-level tuning and analysis to tweak performance on those
performance-critical bottlenecks after the initial round of development.
This last round of optimization often entails using various
system profiling metrics to determine performance-critical bottlenecks,
and then optimizing these portions of the application by hand using
proprietary intrinsics or assembly code, and in some cases rewriting
performance-critical kernel algorithms and/or related data structures.
This article follows on from the topics discussed in Part 1 and details
design decisions that may prove useful for embedded system developers
concerned with the issues those topics mentioned above.
Overview of memory optimization Memory
optimizations of various types are often beneficial to the run-time
performance and even power consumption of a given embedded application.
As was mentioned previously, these optimizations can often be performed
to varying degrees by the application build tools such as compilers,
assemblers, linkers, profilers and so forth. Alternatively, it is often
valuable for developers to go into the application and either manually
tune the performance or design in consideration of memory system
optimization a priori, for either given performance targets or so as to
design the software architecture to be amenable to automated-tool
optimization in subsequent phases of the development cycle.
In
tuning a given application, quite often the baseline or “out of box”
version of the application will be developed. Once functionality is
brought online, the development team or engineers may select to profile
the application for bottlenecks that require further optimization. Often
these are known without profiling, if certain kernels within the
application must execute within a given number of clock cycles as
determined by a spreadsheet or pen and paper exercise during system
definition. Once these key kernels are isolated or key data structures
are isolated, optimization typically begins by those experts with
knowledge of both software optimization techniques, compiler
optimizations, the hardware target and perhaps details of the hardware
target instruction set.
Focusing optimization effortsAmdahl’s
law plays an interesting role in the optimization of full application
stacks, however, and is not always appreciated by the software system
developer. If only 10% of the dynamic run-time of a given application
can benefit from SIMD or instruction-level parallel optimizations versus
the 90% of dynamic run-time that must be executed sequentially, then
inordinate amounts of effort on parallelizing the 10% portion of the
code will still only result in modest performance improvements.
Conversely, if 90% of the total application’s dynamic run-time is spent
in code regions exhibiting large amounts of instruction-level
parallelism and data-level parallelism, it may be worthwhile to focus
engineering effort on parallelizing these regions to obtain improved
dynamic run-time performance.
In determining those portions of
the code which dominate the dynamic application run- time, and may be
the best candidates for either hand optimization or hand adjustment for
applicability to automated-tool optimization, application developers
typically use a software profiler in conjunction with either the silicon
target or software-based system simulation. Intel’s VTUNE is one such
example of a profiling framework; alternatively the GNU GCC compiler and
GPROF are open-source solutions that provide dynamic run-time
information. Many silicon vendors such as Freescale Semiconductor and
Texas Instruments also offer their own proprietary solutions for use
with their respective silicon targets, allowing for either traces
collected on software-based simulation platforms, or alternatively
larger application-level traces that can be collected on the native
silicon target.
Vectorization and the dynamic code: compute ratioVectorization
of loops is an optimization whereby computation performed across
multiple loop iterations can be combined into single vector
instructions, effectively increasing the instruction-to-compute ratio
within the application’s dynamic run-time behavior. Consider the example
in
Figure 5.
short a[16], b[16], c[16];
for(iter=0; iter<16; ++iter)
{
// results in single 16-bit MPY instruction
// generated in assembly listing
//
a[iter] = b[iter] * c[iter]
}
short a[16], b[16], c[16]; for(iter=0; iter<16 iter+=4)
{
// with high level compiler vectorization,
// results in 4-WAY parallel 4x16-BIT multiply
// vector instruction, effectively performing the
// computation of four iterations of the loop in
// a single atomic SIMD4 instruction.
//
a[iter:iter+4] = b[iter:iter+4] * c[iter:iter+4];
}
Figure 5: Loop level vectorization example
In
the first loop nest, we can see that each iteration of the loop
contains a single 16-bit by 16-bit multiply instruction whose result is
written to the a[] array as output. One multiplication instruction is
performed for each iteration of the loop, resulting in 16 16-bit
multiplications. The second loop, however, shows pseudocode for how the
compiler or application developer might vectorize the loop when
targeting an architecture that supports a four-way SIMD multiply
instruction over 16-bit integer elements.
In this case, the
compiler has vectorized multiple iterations of the loop together into
the multiply instruction, as denoted by the array[start_range:end_range]
syntax denoted in the second loop nest. Note that the loop counter is
incremented by the vectorized length for each iteration of the loop now.
Clearly only four iterations over the loop are now needed to compute
the resulting an output array, as each iteration of the loop now
contains a single vector multiply instruction that computes four
elements of the output vector in parallel.
There are many
benefits to vectorizing code in this manner, either by hand if the
application developer uses intrinsics that are proprietary with respect
to the target architecture, or if the compiler is able to vectorize the
code. One such benefit is the increase in performance, as the code now
exploits dedicated SIMD hardware, often providing a multiplication in
improvement over the vectorized loop on the order of the underlying SIMD
vector hardware.
Other benefits are the reduction in code size,
as loops are no longer unrolled resulting in explosions in the code
size, but rather more dense instructions of vector format are used
rather than atomic scalar instructions. This may have secondary benefits
in reducing the number of instruction fetch transactions that go out to
memory as well. Lastly, the overall ratio of dynamically issued
instructions to computation performed within the application is
increased as well.
There are a number of challenges to both the
development tools and the application developers when trying to
vectorize code at the loop level. One such challenge is the code shape
of loop nests that are candidate for vectorization. Typically, build
tools need to understand the loop iteration space of a loop, so using
constant loop bounds rather than run- time computed values may be
beneficial depending on the advancement of the underlying compiler’s
vectorization technology.
Secondly, the types of computation
performed within the loop nest must be amenable to vectorization. For
example, in the example above simple 16-bit integer multiplication is
performed for a target architecture supporting a supposed 16-bit
four-way SIMD multiply instruction. If the underlying target
architecture only supports 8-bit SIMD multiplication, it may be
advantageous to avoid 16-bit multiplication wherever possible if
vectorization is desired.
Loop dependence analysis is another
concern when vectorizing or parallelizing loop nests, as the compiler
must be able to prove the safety of loop transformations. Loop
dependence analysis is the means by which the compiler or dependence
analyzer determines whether statements within a loop body form a
dependence with respect to array accesses and data modifications,
various data reduction patterns, simplification of loop-independent
portions of the code and management of various conditional execution
statements within the loop body.
As an example, consider the fragment of C-language code in
Figure 6.
for(iter_a=0; iter<LOOP_BOUND_A; ++iter_b)
for(iter_b=0; iter_b<LOOP_BOUND_B; ++iter_b)
a[iter_a+4-iter_b] =
b[2*iter_a-iter_b]+ iter_a*iter_b;
Figure 6: Fragment of C language code
For
the loop above, the compiler’s data dependence analyzer will attempt to
find all dependences between the statements reading the array b[] and
writing to the array a[]. The challenge for the data dependence analyzer
is to find all possible dependences between the statements that write
to array a[] and read from array b[]. To ensure safety, the data
dependence analyzer must ensure that it can explicitly prove safety or,
in other words, any dependence that cannot be proven false must be
assumed to be true to ensure safety!
The data dependence analysis
shows independence between references by proving that no two instances
of statements to array a[] and array b[] access or modify the same spot
in array a[]. In the event that a possible dependence is found, loop
dependence analysis will make an attempt to characterize the
dependences, as some types of optimizations over loop nests may still be
possible and profitable. It may also be possible to further transform
the loop nests so as to remove the dependence.
In summary,
writing loop nests so that a minimum of data dependencies exists between
array references will benefit vectorization and other loop transforms
as much as possible. While the compiler technology used in analyzing
data dependencies and autovectorizing serial code for vector hardware
stems from the supercomputing community, improperly written code with
troublesome data dependencies and loop structure may still thwart the
vectorization efforts of the most advanced tool sets.
At a high
level, simply writing code which is easiest for humans to understand
usually produces code that is easiest for the vectorizer to understand
as well, as the vectorizer and data dependence analyzers can easily
recognize what the programmer intended. In other words, highly
hand-tuned code with a priori knowledge of the underlying target
architecture is not the best candidate for automated vectorization at
the tools level.
There are a number of things that application
developers may want to keep an eye out for when developing code with the
intent of autovectorization by the build tools.
Pointer aliasing in COne
challenge for vectorizers and data dependence analysis is the user of
pointers in the C language. When data is passed to a function via
pointers as parameters, it is often difficult or impossible for the data
dependence analyzer and subsequent vectorizer to guarantee that the
memory regions pointed to by the various pointers do not overlap in the
interaction spaces of the loops in which they are computed. As the C
standard has evolved over time, support for the “restrict” keyword has
been added, as can be seen in the example in
Figure 7.
void restrict_compute(restrict int *a, restrict int
*b, restrict int *c)
{
for(int i=0; i<LIMIT; ++i)
a[i] = b[i] * c[i];
}
Figure 7: C Standard “restrict” keyword
By
placing the restrict keyword qualifier on the pointers passed to the
procedure, this ensures to the compiler that the data accessed by a
given pointer with the restrict keyword does not alias with anything
else the function may modify using another pointer. Note that this only
applies to the function at hand, not the global scope of the application
itself. This permits the data dependence analyzer to recognize that
arrays are not aliased or modified by references with other side
effects, and allows more aggressive optimization of the loop nest
including vectorization amongst other optimizations.
Source:-http://www.embedded.com/design/debug-and-optimization/4436459/Achieving-better-embedded-software-performance-through-memory-layout-optimization-Part-2