In the blog post from August 2014 we compared the modularization of image processing algorithms with template based policies against the dynamic polymophism based stragety pattern in terms of speed. We will compare the run-time of (almost) the same pixel-wise operations on images again. But this time, we consider different functional modularization techniques. More precisely, we will pass function pointers, lambdas, and derived functors to algorithms with corresponding argument types such as a templated one and std::function among others.

Our algorithm is a binary operation on two images. It simply traverses the two input images and one output image. Thereby, a pixel-wise binary operation is applied to the input images and the result is stored in the output image. The pixel-wise binary operation will be passed as first-class-citizen-function to different argument types, namely:

  • std::function
  • template argument
  • function pointer
  • base class of functors using dynamic polymorphism

As Input for the function arguments we will use lambda functions, function pointers, and derived functors.

Functional Modularization

In this section we very briefly describe how to apply modularization by passing functions around. Time measurements are the subject of the next section.

The simple image traversal algorithm is depicted in the following

void binary(SomeFuncArgType op, const Image& im_in1, const Image& im_in2,
            Image& im_out, unsigned width, unsigned height)
{
    for (unsigned y = 0u; y < height; y++)
        for (unsigned x = 0u; x < width; x++)
            im_out[x + y * width] = op(im_in1[x + y * width], 
                                       im_in2[x + y * width]);
}

where SomeFuncArgType is a placeholder for the subsequent types and the class Image is just a vector of floats, i.e.,

using Image = std::vector<float>;

Subsequently, we will only show the changed signature, since the bodies are identical.

std::function

The type std::function is a generic function wrapper. You can use std::function, e.g., to store function pointers or functors/lambdas. The usage of std::function for the algorithm’s modularization is depicted in the following signature.

void binary_func(std::function<float(float, float)> op, 
                 const Image& im_in1, const Image& im_in2, 
                 Image& im_out, unsigned width, unsigned height);

Template Argument

As for std::function, you can pass functors/lambdas and function pointers to the algorithm with the template signature.

template <typename F>
void binary_template(F op, const Image& im_in1, const Image& im_in2,
                     Image& im_out, unsigned width, unsigned height);

Raw Pointer

The raw pointer signature is C-compatible and given by

void binary_pointer(float (*op)(float, float), const Image& im_in1, 
                    const Image& im_in2, Image& im_out, unsigned width, 
                    unsigned height)

One can pass function pointers of free functions either by address or directly, i.e.,

binary_pointer(&free_func, ...);
binary_pointer(free_func, ...);

and even lambdas, if they do not capture anything. The latter possibility is ignored in our experiments.

Base Class of Functors using Dynamic Polymorphism

Given the abstract base class BaseFunctor the signature results in

void binary_dyn_poly(const BaseFunctor& op, const Image& im_in1, 
                     const Image& im_in2, Image& im_out, unsigned width,
                     unsigned height);

Remark

vThe standard library algorithms obtain there predicate by value. Hence, this case is the only one where we could not replace the raw for loops with std::transform. However, the execution speed reported below is similar for both raw loops and std::transform in all other cases.

Binary Pixel-Wise Operations

For two addition operations (a usual and an unnecessarily complicated one) and a maximum operation, we use lambdas

auto add_lambda = [](float a, float b) { return a + b; };
auto add_complicated_lambda = [](float a, float b) 
{ 
    return std::sqrt(a * a) + std::sqrt(b * b); 
};
auto maximum_lambda = [](float a, float b) { return std::max(a,b); };

function pointers of free functions

auto add_func (float a, float b) { return a + b; }
auto add_complicated_func(float a, float b) 
{ 
    return std::sqrt(a * a) + std::sqrt(b * b); 
}
auto maximum_func(float a, float b) { return std::max(a, b); }

and functors dervied from the abstract base class

struct AddFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override
    {
        return a + b; 
    } 
};
struct AddComplicatedFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override 
    {
        return std::sqrt(a*a) + std::sqrt(b*b); } 
    };
struct MaxFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override
    {
        return std::max(a, b); 
    } 
};

Speed Measurements

We apply the pixel-wise binary operations with equal operands to a std::vector<float> of 1024x1024 zeros 100 times with each approach. As benchmark we take the minimal run time in microseconds ($\mu{\rm s}$) over the 100 runs. We work with an Intel Core i7-6700HQ CPU @ 2.60GHz mobile processor. Compiler-wise, we use GCC 5.4 and Clang 3.9 on Ubuntu 16.04 and Visual Studio 2017 on Windows 10 (all 64 bit). We compiled in release mode.

We will consider 8 pairs of argument types (signature) and passed types namely:

  • pointer passed as std::function
  • lambda passed as std::function
  • derived functor passed as std::function
  • lambda passed to a templated argument
  • pointer passed to a templated argument
  • derived functor passed to a templated argument
  • pointer passed as pointer
  • derived functor passed as base functor

For each pixel-wise binary operator we will rank the pairs by a penalty for comparison across compilers. To this end, we sort the pairs according to the elapsed times. Each pair will be assigned a penalty of $(t_{\rm cur} - t_{\rm prev}) // 100$ where $t_{\rm cur}$ is the elapsed time of the current pair, $t_{\rm prev}$ is the elapsed time of the previous better pair, and $//$ denotes integer division, i.e., ignoring the remainder. The final ranking is given by the sum of penalties over the different compilers and depicted in the following tables. Additionally, the tables contain the minimal run times in microseconds for each compiler.

add

rank arg type (sgntr) passed type penalty gcc $\mu$s clang $\mu$s vs $\mu$s
1 template lambda 0 1549 827 819
2 pointer pointer 11 2168 827 1552
3 template pointer 14 2147 829 1866
4 template derived functor 25 1881 2172 1868
5 base functor derived functor 26 2064 2175 1890
6 std::function lambda 49 4137 2021 2464
7 std::function derived functor 54 4016 2567 2721
8 std::function pointer 55 3974 2480 2879

complicated add

rank arg type (sgntr) passed type penalty gcc $\mu$s clang $\mu$s vs $\mu$s
1 template lambda 0 7460 5918 846
2 pointer pointer 63 8264 5948 6494
3 template pointer 65 8249 5946 6731
4 template derived functor 94 8230 8677 6947
4 base functor derived functor 94 8239 8673 6964
6 std::function lambda 102 8546 8740 7628
7 std::function pointer 104 8535 8854 7773
8 std::function derived functor 105 8527 8866 7907

max

rank arg type (sgntr) passed type penalty gcc $\mu$s clang $\mu$s vs $\mu$s
1 template lambda 0 1631 834 904
2 template pointer 15 2312 837 1922
2 pointer pointer 15 2556 834 1605
4 template derived functor 27 2340 2180 1872
5 base functor derived functor 29 2351 2429 1865
6 std::function lambda 45 3905 2018 2415
7 std::function derived functor 60 4642 2586 2828
8 std::function pointer 63 5000 2552 2835

Oddity

Visual Studio 2017 is unbelievably fast for the complicated add case, when the lambda is passed to the templated algorithm implementation. This needs further investigation and might lead to another blog post.