Faster Signal Processing with Intel Advanced Vector Extensions 2.0
28 | 2013 | 8 th Edi t ion | Embedded Innovator | inte l .com/embedded-innovator O ver the past few years, Intel has significantly improved the vector-processing per- formance of its processors, making them increasingly popular targets for signal and image processing. The Intel® Advanced Vector Extensions (Intel® AVX) 2.0 introduced in the Haswell microarchitecture take these capabilities to a new level, delivering a 2x increase in peak floating-point throughput for an impressive 307 billion floating point operations per second (GFLOPS) at 2.4 GHz in a quad-core 4th generation Intel® Core™ processor. Fixed-point arithmetic also sees a 2x boost in peak throughput, and both fixed- and floating-point algorithms benefit from new vector gather, scatter, and permute operations. This article discusses the value of implementing signal and image processing on the Haswell microarchitecture and highlights applications that can benefit from Intel AVX 2.0. We quantify the improvements developers can expect through performance tests provided by N.A. Software, an Affiliate member of the Intel® Intelligent Systems Alliance, and show how libraries from Alliance members can help them achieve these gains. In addition, we explain how to use the tools in Intel® System Studio to code for Intel AVX 2.0 and shorten development time. SIMD Enhancements in Intel® Architecture (IA) Signal and image processing software written for Intel® architecture (IA) processors can use single instruction multiple data (SIMD) instructions to process data in parallel—a technique known as vector processing. With this technique, multiple data values are loaded into SIMD registers to perform operations on all data elements at once (see Figure 1). Performing signal and image processing solely on an IA processor— as opposed to requiring a companion digital signal processor (DSP)—has numerous benefits. For example, eliminating the DSP enables developers to: • Run their solutions on a single processor, shrinking system size and bill-of-material costs • Reduce programming efforts through a single code base and best-in-class tools • Minimize the risk of data-stream timing issues and other runtime problems associated with the communications between the IA processor and DSP Starting with the advent of Intel AVX in 2011, Intel has significantly improved vector-processing performance in each generation of its processors, enabling a growing range of signal and image processing applications to migrate to IA (Figure 2). This year, Intel AVX 2.0 takes vector performance to a new level with improvements that include: • Fused multiply-add (FMA) instructions for 2x higher peak throughput—up to 307 GFLOPS at 2.4 GHz in a quad-core 4th generation Intel Core processor SIMD vectorization loads multiple data values into SIMD registers to perform operations on all data elements at once. Figure 1. + + + + + + ++ + a b c a[7] a[6] a[5] a[4] a[3] a[2] a[1] a[0] b[7] b[6] b[5] b[4] b[3] b[2] b[1] b[0] c[7] c[6] c[5] c[4] c[3] c[2] c[1] c[0] Faster Signal Processing with Intel® Advanced Vector Extensions 2.0 Simplify Coding with Vectorizing Compilers and Libraries By Noah Clemons, Technical Consulting Engineer, Embedded Computing & Debuggers, Intel® Corporation Peter Carlston, Platform Architect, Intelligent Systems Group, Intel® Corporation David Murray, Technical Director, N.A. Software S i g n a l & I m a g e P r o c e s s i n g inte l .com/embedded-innovator | Embedded Innovator | 8 th Edi t ion | 2013 | 29 • Extension of most integer instructions to 256 bits for 2x higher peak integer throughput • New vector gather, shift, and cross-lane permute functions that enable more vectorization and more efficient loads and stores Many improvements have been made in the Haswell microarchitecture to allow applications to realize this greater performance potential: • The memory pipeline can now perform 2 loads and a store on each cycle • L1 cache bandwidth has doubled to 96 bytes/cycle (64 byte read plus 32 byte write) • L2 cache bandwidth has also doubled to 64 bytes/cycle These upgrades, along with the internal Last Level Cache, 320 GBps Ring Bus, and DDR3 dual-channel memory (with a peak memory bandwidth of 25 GBps at 1,600 MHz) help keep the processor fed for maximum performance. The applications likely to benefit from Intel AVX2 include those that are CPU-bound, and also those that spend a significant time in vectorizable loops with: • Iteration count ≥ vector width (i.e., ≥ 8 integers, 8 floats, or 4 doubles) • Integer arithmetic and bit manipulation (e.g., video processing) • Floating-point operations that can make use of FMAs (e.g., linear algebra) • Non-contiguous memory access (i.e., those that can use the new gather and permute instructions) Performance Benchmarks N.A. Software (NAS) develops and licenses advanced radar algorithms and low-level DSP libraries including the Vector, Signal, and Image Processing Library (VSIPL). This open application programming interface (API) provides portable computational middleware for image and signal processing functions as defined by the VSIPL Forum. VSIPL supports multithreading and is typically used on large multi-core and shared memory systems, providing scalable performance for large problems. NAS has produced a highly optimized Intel AVX 2.0 VSIPL library that is especially well optimized for complex vector multiply operations, sine/cosine (when the data is not range reduced), and split complex FFTs. The NAS library is standalone code that does not rely on any third party software, enabling the library to be recompiled for any operating system quickly and easily to gain the most out of the Intel AVX 2.0 instruction set. As noted above, NAS is an Affiliate member of the Intel Intelligent Systems Alliance. VSIPL implementations are available from other Alliance members including Associate member GE Intelligent Platforms and General member Curtiss-Wright* Controls Defense Solutions. These companies are just some of the 250+ global members of the Alliance. From modular components to market- ready systems, Intel and members of the Alliance collaborate closely to provide the performance, connectivity, manageability, and security developers need to create smart, connected systems. Intel has significantly upgraded vector-processing performance in recent processor generations. Figure 2. Future Extensions Intel® AVX 2.0: Fused Multiply-Add (2x Peak FLOPS) 256-bit Integer Vectors (2x Peak Throughput) Gather/Shift/Permute Haswell Microarchitecture (22 nm Tock) 1999 …… 2011 2012 2013 Intel® Microarchitecture Code Name Sandy Bridge (32nm Tock)a Pe rf or m an ce /C or e Since 1999: 128-bit Vectors Die Shrink Code Name Ivy Bridge (22 nm Tick) Intel® AVX (Float 16): Half-Float Support Random Numbers Sandy Bridge Microarchitecture (32 nm Tock) Intel® Advanced Vector Extensions (Intel® AVX): 256-bit Floating-Point Vectors (2x Peak Floating Point Operations Per Second (FLOPS)) Signal Processing Advanced Vector Extensions 2.0 ©2013, Intel Corporation. All rights reserved. Intel, the Intel logo, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel® System Studio New tools for embedded and mobile system developers. Intel® System Studio provides deep system-level insights into power, performance, and reliability. > Speed development and testing > Enhance code stability > Boost power effi ciency and performance LEARN MORE AT: inte l .com/embedded-innovator | Embedded Innovator | 8 th Edi t ion | 2013 | 31 NAS recently used VSIPL to benchmark Intel AVX 2.0. As shown in Figure 3, Intel AVX 2.0 produces speedups reaching over 2x depending on the function. Note that NAS also benchmarked the processors with the Intel® Math Kernel Library (Intel® MKL). NAS found that Intel MKL was much better at 2D FFTs with non-square matrix data, but the NAS library was more optimized for other operations and lengths. Figure 4 provides more detail on the 1D FFT routine using split complex data. NAS also benchmarked Intel AVX 2.0 with their Synthetic Aperture Radar and Moving Target Indication (SARMTI) advanced radar processing algorithm. SARMTI is able to extract high resolution data with the positions of all slow- and fast-moving objects directly from the SAR image itself so a separate MTI radar is not needed. Four different sets of image sizes and numbers/locations of moving objects were studied. The results showed Intel AVX 2.0 performed faster than Intel AVX across all scenarios, with speedups of 1.26x to 1.52x.i Coding for Intel® Advanced Vector Extensions SIMD code has a well-earned reputation for being time- consuming to write and rarely portable. To overcome these obstacles, Intel and members of the Alliance have created compilers and libraries that streamline programming and enable straightforward porting across IA processors. In addition to the VSIPL libraries mentioned earlier, Intel offers its Intel® System Studio, which incorporates: • Intel® C++ Compiler 13.0 for on-the-fly generation of Intel AVX 2.0 code • Hand-tuned vectorized and threaded implementations of commonly used embedded signal/image and math-intensive processing functions in found in two performance libraries: Intel® Integrated Performance Primitives (Intel® IPP) and Intel® Math Kernel Library (Intel® MKL) The auto-vectorization capabilities of the Intel C++ Compiler provide an excellent way to generate Intel AVX2 code. This compiler and its libraries can: • Generate faster code through speed optimizations • Enable shorter execution times for low-power code • Support GNU* cross-build, integration into Eclipse* CDT, and Yocto Project* Application Development Toolkit The Intel C++ Compiler supports cross-compilation and integration with Poky-Linux* or OpenEmbedded* compatible GNU* cross-toolchains as used for Wind River Linux*, Yocto Project* and many other custom GNU* cross-build toolchain (Wind River is an Associate member of the Alliance). It comes with pre- defined compiler environment files that make cross-development a simple matter of applying a compiler switch. Intel System Studio provides access to the Intel C++ Compiler and other tools through industry common integrated development environments (IDEs) such as the Eclipse* IDE. These tools include a guided auto- parallelism selection tool and a unique environment file editor that enables integration of the Intel C++ Compiler into a GNU cross-build toolchain from within the IDE. Vectorization works the same as for past Intel® Streaming SIMD Extensions (Intel® SSE) and Intel AVX instruction sets. To include older code paths, developers can set a compiler flag. For example, the developer can set flags to generate code for both Intel AVX 2.0 and Intel AVX, and have the code target the appropriate version automatically at runtime. Intel® Advanced Vector Extensions 2.0 shows performance gains across a range of vector lengths.i N.A. Software benchmarked different versions of Intel® Advanced Vector Extensions 2.0 to show the performance gain.i Figure 4. Figure 3. 256 GF LO PS R at e Data Length NAS vsip_ccfftip_f with Split Complex Data MFLOPS = 5 N Log2(N)/(Time for One FFT in Microseconds) 0 5 10 15 20 25 30 35 40 45 1K 4K 16K 256K 512K Intel® AVX 2.0 Intel® AVX (Float 16) Intel® AVX VSIPL Function Parameters [columns*rows] Intel® AVX 2.0 vs. Intel® AVX 1D FFT 256 – 512K 1.45 – 1.77 X Multiple 1D FFT 256*256 – 2K*2K 1.43 – 1.6 X 2D FFT, non-square matrices 256*256 – 128K*20 1.14 - 1.22 X 2D FFT, smaller square matrices 64*64 – 2K*2K 1.39 – 1.61 X Complex Matrix Transpose 256*256 – 2K*2K 1.09 - 1.48 X Vector Multiply 256 – 128K 1.28 – 1.57 X Vector Sine 1.90 – 2.32 X Vector Cosine 1.90 – 2.05 X Vector Square Root 1.21 – 1.29 X Vector Scatter 1.24 – 1.34 X Vector Gather 1.21 – 1.56 X S i g n a l & I m a g e P r o c e s s i n g 32 | 2013 | 8 th Edi t ion | Embedded Innovator | inte l .com/embedded-innovator Performance Libraries within Intel® System Studio Two libraries in Intel System Studio enable highly optimized data and signal processing on IA: • Intel IPP is an extensive library of highly optimized software building blocks for the most demanding signal, data, and multimedia applications • Intel MKL provides highly optimized threaded math routines The libraries are designed for portability and can run across different versions of the IA architecture. They detect the host processor at runtime and deploy the correct optimized code. Particularly advantageous, these hand tunings will not require a redesign of critical functions with successive hardware platform upgrades. Intel IPP offers an extensive C library of highly optimized building blocks for a wide variety of domains and data types (see Figure 5). Intel IPP is supplied as a sequential library for higher efficiency for smaller data sets, latency- constraint applications, or better control via application- level threading. Intel also makes dynamic link libraries available that provide internal multithreading; to support multithreading, Intel IPP is fully thread-safe. C language call-convention and linkage (undecorated functions names) allow for calling Intel IPP functions from almost all programming languages and compilers. In addition to the libraries, Intel IPP provides supporting code such as an example driver, along with an application requesting this driver’s service. Such an example may help in developing code that runs in kernel mode (ring 0). To support such code, Intel IPP provides libraries that are not position-independent (“nonpic”). For floating point data types, Intel MKL is available as a single- or multi-threaded implementation. Linear algebra is one of the primary domains of Intel MKL (see Figure 6), and these functions see a particularly large benefit from the fused multiply add introduced in Intel AVX 2.0. The Fast Fourier Transforms (FFTs) in Intel MKL support sophisticated descriptors, giving forward/backward functions a slightly simpler interface compared to Intel IPP. Along with threading, Intel MKL includes sophisticated optimizations. Massive workloads up to the 64-bit integer index space are supported (“ilp64”). Intel MKL focuses on high throughput; however, application- level threading can be served by sequential implementations similar to Intel IPP. Note that calling multi-threaded Intel MKL functions from multiple threads is thread-safe. Take Advanced Signal and Image Processing to the Next Level As the results cited from NAS show, Intel AVX 2.0 delivers excellent vector-processing performance gains. To realize these gains, developers can generate Intel AVX 2.0 code on-the-fly with the Intel C++ Compiler and can call on hand-tuned libraries using Intel IPP, Intel MKL, or VSIPL. As developers of signal and image processing equipment look for ways to eliminate DSPs and run their entire solution on the latest IA processors, these tools and libraries will continue to grow in importance—but fortunately not in complexity. These best-in-class tools streamline SIMD code development, making one-processor solutions that simplify programming with a single code base all the more attractive. For more on Intel System Studio, see; for more on NA Software VISPL, see To learn more about advanced signal processing, see Contact Intel From modular components to market- ready systems, Intel and the 250+ global member companies of the Intel® Intelligent Systems Alliance (intel. com/intelligentsystems-alliance) provide the connectivity, manageability, security, and performance developers need to create smart, connected systems. GE Intelligent Platforms ( and Wind River ( are Associate members of the Alliance, N.A. Software Ltd. ( is an Affiliate member, and Curtiss-Wright Controls Defense Solutions ( is a General member. Intel® Math Kernel Library domains. Intel® Integrated Performance Primitives cover many domains. Figure 6. Figure 5. Linear Algebra • BLAS, Sparse BLAS • LAPACK Solvers • Sparse Solvers (DSS, PARADISO) • Iterative Solver (RCI) • ScaLAPACK, PBLAS Random Number Generators • Congruential • Wichmann-Hill • Mersenne Twister • Sobol • Neiderreiter • Non-deterministic Fast Fourier Transforms • Multidimensional • FFTW Interfaces • Cluster FFT • Trig. Transforms • Poisson Solver • Convolution via VSL Vector Math • Trigonometric • Hyperbolic • Exponential, Logarithmic • Power/Root Summary Statistics • Kurtosis • Variation Coefficient • Quantiles • Order Statistics • Min/Max • Variance-covariance Data Fitting • Spline-based • Interpolation • Cell Search Signal Processing (1D) • Transforms (e.g., Wavelet) • Convolution/Correlation • Filtering (e.g., IIR, FIR) • Statistics Vector/Matrix • Logical, Shift, Conversion • Trigonometric Functions • Decomposition, Eigenvalues • Transpose Image Processing (2D) • Transforms (e.g., Rotation) • (Non-)lin. Filter (e.g., Noise) • FFT, DFT, DCT • Statistics Integrity /Compression/ Cryptography • Error Correction, Reed-Solomon • Compression (Entropy, Dict.) • MD5, T(DES), RSA, DSA • Random Number Generators Color Conversion • Color Space Conversion • Pattern (e.g., Bayer) • Brightness/Contrast • Resampling • • • More Domains • Video, Picture Coding • Audio (e.g., Speech Coding) • String Processing • Utilities S i g n a l & I m a g e P r o c e s s i n g