How can I optimize the performance of library-free C/C++ code generated from deep learning networks?

Jack Ferrari asked: 2023-03-30 deep learning , code generation , MATLAB Coder Interface for Deep Learning Libraries

Optimize performance of library-free C/C++ code generated from deep learning networks using MATLAB. Learn optimization techniques & improve efficiency. Get star

Expert Answer

Prashant Kumar answered . 2025-09-17 10:49:27

Vectorization and multi-threading are techniques that can improve the performance of embedded applications.

Both allow processors to make more efficient use of available resources and complete tasks faster, either by executing the same instruction on multiple data elements simultaneously (vectorization), or by dividing a workload into threads for concurrent execution across several cores (multi-threading).

With MATLAB Coder, you can take advantage of vectorization through the use of SIMD (Single Instruction, Multiple Data) instrinsics available in code replacement libraries for ARM Cortex-A and M targets. On Intel and AMD CPUs, enable SIMD with the AVX2 or AVX512 instruction set extensions. For processors that support multi-threading, enable OpenMP.

Additionally, as of R2023a, you can enable bfloat16 compression of network learnables. For deep learning networks that are resilient to precision loss, compressing learnables from single-precision to bfloat16 datatypes greatly reduces memory usage with little change in inference accuracy. This process does not require calibration data and also increases inference speeds. Any hardware that supports single-precision floating-point datatypes can benefit from bfloat16.

Note: these settings are general guidelines. Depending on your specific application and hardware target, changes to additional configuration settings may lead to added performance.

Raspberry Pi

>> cfg = coder.config('lib');
>> cfg.Hardware = coder.Hardware('Raspberry Pi');
>> cfg.CodeReplacementLibrary = "GCC ARM Cortex-A";
>> cfg.EnableOpenMP = true;
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
>> cfg.DeepLearningConfig.LearnablesCompression = 'bfloat16'; % Requires R2023a or later

Generic ARM Cortex-A

>> cfg = coder.config('lib');
>> cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-A';
>> cfg.CodeReplacementLibrary = "GCC ARM Cortex-A";
>> cfg.EnableOpenMP = true;
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
>> cfg.DeepLearningConfig.LearnablesCompression = 'bfloat16'; % Requires R2023a or later

Generic ARM Cortex-M

>> cfg = coder.config('lib');
>> cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-M';
>> cfg.CodeReplacementLibrary = 'ARM Cortex-M';
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
>> cfg.DeepLearningConfig.LearnablesCompression = 'bfloat16'; % Requires R2023a or later

x86 Hardware (Intel and AMD)

Intel

>> cfg = coder.config('lib');
>> cfg.HardwareImplementation.ProdHWDeviceType = 'Intel->x86-64 (Linux 64)'; % If deploying on Linux
>> cfg.InstructionSetExtensions = 'AVX512F'; % or 'AVX2' if 'AVX512F' is not available
>> cfg.EnableOpenMP = true;
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
>> cfg.DeepLearningConfig.LearnablesCompression = 'bfloat16'; % Requires R2023a or later

AMD

>> cfg = coder.config('lib');
>> cfg.HardwareImplementation.ProdHWDeviceType = 'AMD->x86-64 (Linux 64)'; % If deploying on Linux
>> cfg.InstructionSetExtensions = 'AVX512F'; % or 'AVX2' if 'AVX512F' is not available
>> cfg.EnableOpenMP = true;
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
>> cfg.DeepLearningConfig.LearnablesCompression = 'bfloat16'; % Requires R2023a or later