ARM Neon Optimization InterLeaving/De-Interleaving

Introduction

In this article we will look at basic interleaving and de-interleaving operations using ARM Neon optimization and evaluate the performance improvements on android based mobile device in comparison with standard opencv code
ARM Neon
ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of a piece of code.
SIMD technology allows process multiple data with one instruction call, saving time for other computations A set of pixels will be processed at a time.
One way to achieve this is to write assembly code ,which requires a steep learning curve and requires knowledge of processor architecture,instruction set etc.
Instead of using low-level instructions directly. There are special functions, called intrinsic, which can be treated as regular functions but they works with input data simultaneously.
Deinterleaving and Interleaving channels of Image
NEON structure loads read data from memory into 64-bit NEON registers, with optional deinterleaving. Stores work similarly, reinterleaving data from registers before writing it to memory.
A set of neon intrinsic instruction set are provided for deinterleaving data.
The simultaneously pull data from the memory and seperate the data into different registers This is called deinterleaving .
The Neon structure loads the data from the memory into 64 bit neon registers with optional interleaving.
The opencv funtions split and merge are ported to arm neon and performance comparision with opencv code is performed.
Data loads interleaves elements based on the size specified in the instruction .
De-InterLeaving
The de-interleave seperates the pairs of adjacenet elements in the memory into seperate registers.
the VLD3 instruction seperates/de-interleaves the BGR channels of the image and sperates them into 3 different registers.The BGR values are stored in adjacent memory locations.
The result of vld instruction is then stored to registers which point to destination memory location
NDK BUILD
Since the application is being developed for android applications,the android NDK toolchain is used for cross compilation.
ndk-build utility is used to build the application.
The ndk-build utility requires that based build directory contain a directory called \textbf{jni}
The jni directory contains all the source files as well as \textbf{Android.mk,Application.mk} which are the makefile for build process.
In the present application the jni directory contains the files \textbf{neon.cpp,helloneon-intrinsics.c,helloneon-intrinsics.h} source files.
To initiated the build process in verbose mode execute the command
This generates the helloneon binary in the \textbf{libs/armeabi-v7a} directory
The directory is transferred to the directory \textbf{/data/tmp/local/NEON_TEST} on android mobile device
The \textbf{/data/tmp/local} directory and files created under this directory can contains files with execute permission.I could not find any other sud-directory under the file system which provided execute permission for binaries or ability to provide execute permissions for binaries.
The script a.ksh being called below exports basic variables and then executes the binary.
The performance of neon intrinsic function is compared with standard opencv split function
OPENCV : 15ms
NEON : 11ms
There is not a very significant improvement seen due to neon optimization.
As per many references and by viewing the disassembly output of the compiler it can be seen that the main reason was found that the arm compiler is not able to generate optimized assembly code .
The compiler generates heavily unoptimized code that results in larger number of cycles than required.
The compilation commands were taken from the ndk-build verbose build output and the -c flag was replaced with -s to generate the assembly code
The above command will generate the the file \textbf{helloneon-intrinsics.s} in the present directory
A lot of unecessary instruction can be observed in the assembly code.
The assembly level code corresponding to the functions were optimized and compiled
For compilation again the debug build output observed from ndk-build process as modified so that \textbf{helloneon-intrinsics.o} object file is compiled from \textbf{helloneon-intrinsics.s} and helloneon binary file is compiled and linked from all source files.
The results of the optimization process is as follows
OPENCV : 15ms
NEON : 8ms
NEON OPTIMIZED : 6 ms
Thus a speedup factor of 1.4 and total performance improvement of 2.5x was observed.
Thus it can be seen that atleast 2.5x improvement is observed after optimizing the assembly code.
This still does not motivate the use of assembly level coding since the developement effort may outweight the optimization benifits.
InterLeaving
The interleaving operation corresponds to combining 3 independent channels of a image into multi-channel image.
Each element of idependent channels are stored in adjacent locations in the multi-channel image.
The performance is as follows :
OPENCV : 9ms
NEON OPTIMIZED : 3 ms
The interleaving process shows a performance improvement of about 3x.
Thus by using neon intrinsics we can achieve performance improvements wrt standard C code and by optimizing the assembly code further performance benifits can be achived.
It is to be noted that OPENCV code is compiled with SSE optimization which may also be in play hence the actual code speedup may be higher.
However a large speedup was not observed in the interleaving and de-interleaving operation due to optimizing the assembly code .
Code
The code for the same can be found in the git repository https://github.com/pi19404/OpenVision in the POC/ARM subdirectory.
The jni subdirectory consists of the source files as well as the make files.
The files \textbf{generate_assembly.ksh} generate the helloneon-intrinsis.s files in the ARM directory.After modifying the file copy it to the jni sub-directory,
\textbf{compile_assembly.ksh} compiles the helloneon-intrinsis.s and also the binary file
The binary requires the opencv library files which needs to be transferred to the android mobile device
The link Execution Cycle computation : shows the number of execution cycles taken by ARM assembly code ,which can be used to check the performance of compiler generated and optimized code.