Low-Level C++ Programming Tutorial: SIMD, x86, Vectorization, Arrays, POD

Introduction

This tutorial teaches low-level C++ programming concepts, focusing on POD data structures, array operations, x86 intrinsics, vectorization, and SIMD, without using STL or modern C++ features. We'll progress from basics to advanced, with explanations, examples, and exercises. Complete the exercises in the Godbolt iframes and compile to check output. Use -O2 flag for optimizations.

Section 1: POD Data Structures

Example

Plain Old Data (POD) structures are simple, C-compatible data types with contiguous memory layout, enabling efficient copying and cache access. They have no constructors, destructors, or virtual functions, making them ideal for performance programming.

Exercise

Complete the subtract_vectors function to subtract two vectors element-wise. Compile and run to check if the output is 12.


// Completed subtract_vectors
struct Vector3 {
    float x, y, z;
};

Vector3 subtract_vectors(Vector3 a, Vector3 b) {
    Vector3 result;
    result.x = a.x - b.x;
    result.y = a.y - b.y;
    result.z = a.z - b.z;
    return result;
}

// Main function to test subtract_vectors (sum 12).
int main() {
    Vector3 a = {6.0, 7.0, 8.0};
    Vector3 b = {1.0, 2.0, 3.0};
    Vector3 result = subtract_vectors(a, b);
    volatile float out = result.x + result.y + result.z;
    return (int)out;
}
                        

Section 2: Array Operations

In low-level C++, arrays are fixed-size or dynamically allocated with malloc, managed manually for performance. They are POD-like when used with primitive types, enabling efficient memory access and loop optimizations.

Example

Exercise

Complete the max_array function to find the maximum value in the array. Compile and run to check if the output is 4.


// Completed max_array
const int SIZE = 4;
float max_array(float arr[SIZE]) {
    float max_val = arr[0];
    for (int i = 1; i < SIZE; i++) {
        if (arr[i] > max_val) {
            max_val = arr[i];
        }
    }
    return max_val;
}

// Main function to test max_array (4).
int main() {
    float arr[SIZE] = {1.0, 4.0, 3.0, 2.0};
    volatile float out = max_array(arr);
    return (int)out;
}
                        

Section 3: x86 Intrinsics

x86 intrinsics allow direct access to CPU instructions via inline assembly, providing fine-grained control over operations for performance-critical code. This section introduces basic assembly for arithmetic.

Example

Exercise

Complete the subtract function using inline assembly. Compile and run to check if the output is 2.


// Completed subtract
int subtract(int a, int b) {
    int result;
    __asm__ (
        "movl %1, %%eax;"
        "subl %2, %%eax;"
        "movl %%eax, %0;"
        : "=r"(result)
        : "r"(a), "r"(b)
        : "%eax"
    );
    return result;
}

// Main to test (2).
int main() {
    volatile int out = subtract(5, 3);
    return out;
}
                        

Section 4: Vectorization with Arrays

Vectorization is a compiler optimization that converts scalar operations into SIMD instructions for parallel execution. Use pragmas to hint the compiler, and ensure arrays are aligned and loops are simple.

Example

Exercise

Add the pragma to hint vectorization for the dot_product function. Compile and run to check if the output is 30.


// Completed dot_product
const int SIZE = 4;
float dot_product(float a[SIZE], float b[SIZE]) {
    float sum = 0.0;
    #pragma omp simd
    for (int i = 0; i < SIZE; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

// Main to test (30).
int main() {
    float a[SIZE] = {1.0, 2.0, 3.0, 4.0};
    float b[SIZE] = {1.0, 2.0, 3.0, 4.0};
    volatile float out = dot_product(a, b);
    return (int)out;
}
                        

Section 5: SIMD with x86 Intrinsics

SIMD (Single Instruction, Multiple Data) uses x86 intrinsics to perform parallel operations on data vectors. Using for SSE, we can process 4 floats simultaneously, significantly boosting performance for array operations.

Example

Exercise

Complete the multiply_arrays_simd function using _mm_mul_ps. Compile and run to check if the output is 30.


// Completed multiply_arrays_simd
#include 

void multiply_arrays_simd(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vr = _mm_mul_ps(va, vb);
        _mm_store_ps(&result[i], vr);
    }
}

// Main to test (30).
int main() {
    const int SIZE = 4;
    float a[SIZE] = {1.0, 2.0, 3.0, 4.0};
    float b[SIZE] = {1.0, 2.0, 3.0, 4.0};
    float result[SIZE];
    multiply_arrays_simd(a, b, result, SIZE);
    volatile float out = 0;
    for (int i = 0; i < SIZE; i++) out += result[i];
    return (int)out;
}