In the blog post from August 2014 we compared the modularization of image processing algorithms with template based policies against the dynamic polymophism based stragety pattern in terms of speed. We will compare the run-time of (almost) the same pixel-wise operations on images again. But this time, we consider different functional modularization techniques. More precisely, we will pass function pointers, lambdas, and derived functors to algorithms with corresponding argument types such as a templated one and std::function among others.

Our algorithm is a binary operation on two images. It simply traverses the two input images and one output image. Thereby, a pixel-wise binary operation is applied to the input images and the result is stored in the output image. The pixel-wise binary operation will be passed as first-class-citizen-function to different argument types, namely:

std::function
template argument
function pointer
base class of functors using dynamic polymorphism

As Input for the function arguments we will use lambda functions, function pointers, and derived functors.

Functional Modularization

In this section we very briefly describe how to apply modularization by passing functions around. Time measurements are the subject of the next section.

The simple image traversal algorithm is depicted in the following

void binary(SomeFuncArgType op, const Image& im_in1, const Image& im_in2,
            Image& im_out, unsigned width, unsigned height)
{
    for (unsigned y = 0u; y < height; y++)
        for (unsigned x = 0u; x < width; x++)
            im_out[x + y * width] = op(im_in1[x + y * width], 
                                       im_in2[x + y * width]);
}

where SomeFuncArgType is a placeholder for the subsequent types and the class Image is just a vector of floats, i.e.,

using Image = std::vector<float>;

Subsequently, we will only show the changed signature, since the bodies are identical.

std::function

The type std::function is a generic function wrapper. You can use std::function, e.g., to store function pointers or functors/lambdas. The usage of std::function for the algorithm’s modularization is depicted in the following signature.

void binary_func(std::function<float(float, float)> op, 
                 const Image& im_in1, const Image& im_in2, 
                 Image& im_out, unsigned width, unsigned height);

Template Argument

As for std::function, you can pass functors/lambdas and function pointers to the algorithm with the template signature.

template <typename F>
void binary_template(F op, const Image& im_in1, const Image& im_in2,
                     Image& im_out, unsigned width, unsigned height);

Raw Pointer

The raw pointer signature is C-compatible and given by

void binary_pointer(float (*op)(float, float), const Image& im_in1, 
                    const Image& im_in2, Image& im_out, unsigned width, 
                    unsigned height)

One can pass function pointers of free functions either by address or directly, i.e.,

binary_pointer(&free_func, ...);
binary_pointer(free_func, ...);

and even lambdas, if they do not capture anything. The latter possibility is ignored in our experiments.

Base Class of Functors using Dynamic Polymorphism

Given the abstract base class BaseFunctor the signature results in

void binary_dyn_poly(const BaseFunctor& op, const Image& im_in1, 
                     const Image& im_in2, Image& im_out, unsigned width,
                     unsigned height);

Remark

vThe standard library algorithms obtain there predicate by value. Hence, this case is the only one where we could not replace the raw for loops with std::transform. However, the execution speed reported below is similar for both raw loops and std::transform in all other cases.

Binary Pixel-Wise Operations

For two addition operations (a usual and an unnecessarily complicated one) and a maximum operation, we use lambdas

auto add_lambda = [](float a, float b) { return a + b; };
auto add_complicated_lambda = [](float a, float b) 
{ 
    return std::sqrt(a * a) + std::sqrt(b * b); 
};
auto maximum_lambda = [](float a, float b) { return std::max(a,b); };

function pointers of free functions

auto add_func (float a, float b) { return a + b; }
auto add_complicated_func(float a, float b) 
{ 
    return std::sqrt(a * a) + std::sqrt(b * b); 
}
auto maximum_func(float a, float b) { return std::max(a, b); }

and functors dervied from the abstract base class

struct AddFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override
    {
        return a + b; 
    } 
};
struct AddComplicatedFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override 
    {
        return std::sqrt(a*a) + std::sqrt(b*b); } 
    };
struct MaxFunctor : public BaseFunctor
{ 
    virtual float operator()(float a, float b) const override
    {
        return std::max(a, b); 
    } 
};

Speed Measurements

We apply the pixel-wise binary operations with equal operands to a std::vector<float> of 1024x1024 zeros 100 times with each approach. As benchmark we take the minimal run time in microseconds ($\mu{\rm s}$) over the 100 runs. We work with an Intel Core i7-6700HQ CPU @ 2.60GHz mobile processor. Compiler-wise, we use GCC 5.4 and Clang 3.9 on Ubuntu 16.04 and Visual Studio 2017 on Windows 10 (all 64 bit). We compiled in release mode.

We will consider 8 pairs of argument types (signature) and passed types namely:

pointer passed as std::function
lambda passed as std::function
derived functor passed as std::function
lambda passed to a templated argument
pointer passed to a templated argument
derived functor passed to a templated argument
pointer passed as pointer
derived functor passed as base functor

For each pixel-wise binary operator we will rank the pairs by a penalty for comparison across compilers. To this end, we sort the pairs according to the elapsed times. Each pair will be assigned a penalty of $(t_{\rm cur} - t_{\rm prev}) // 100$ where $t_{\rm cur}$ is the elapsed time of the current pair, $t_{\rm prev}$ is the elapsed time of the previous better pair, and $//$ denotes integer division, i.e., ignoring the remainder. The final ranking is given by the sum of penalties over the different compilers and depicted in the following tables. Additionally, the tables contain the minimal run times in microseconds for each compiler.

add

rank	arg type (sgntr)	passed type	penalty	gcc $\mu$s	clang $\mu$s	vs $\mu$s
1	template	lambda	0	1549	827	819
2	pointer	pointer	11	2168	827	1552
3	template	pointer	14	2147	829	1866
4	template	derived functor	25	1881	2172	1868
5	base functor	derived functor	26	2064	2175	1890
6	`std::function`	lambda	49	4137	2021	2464
7	`std::function`	derived functor	54	4016	2567	2721
8	`std::function`	pointer	55	3974	2480	2879

complicated add

rank	arg type (sgntr)	passed type	penalty	gcc $\mu$s	clang $\mu$s	vs $\mu$s
1	template	lambda	0	7460	5918	846
2	pointer	pointer	63	8264	5948	6494
3	template	pointer	65	8249	5946	6731
4	template	derived functor	94	8230	8677	6947
4	base functor	derived functor	94	8239	8673	6964
6	`std::function`	lambda	102	8546	8740	7628
7	`std::function`	pointer	104	8535	8854	7773
8	`std::function`	derived functor	105	8527	8866	7907

max

rank	arg type (sgntr)	passed type	penalty	gcc $\mu$s	clang $\mu$s	vs $\mu$s
1	template	lambda	0	1631	834	904
2	template	pointer	15	2312	837	1922
2	pointer	pointer	15	2556	834	1605
4	template	derived functor	27	2340	2180	1872
5	base functor	derived functor	29	2351	2429	1865
6	`std::function`	lambda	45	3905	2018	2415
7	`std::function`	derived functor	60	4642	2586	2828
8	`std::function`	pointer	63	5000	2552	2835

Oddity

Visual Studio 2017 is unbelievably fast for the complicated add case, when the lambda is passed to the templated algorithm implementation. This needs further investigation and might lead to another blog post.

Functional Modularization

std::function#

Template Argument#

Raw Pointer#

Base Class of Functors using Dynamic Polymorphism#

Binary Pixel-Wise Operations#

Speed Measurements

add#

complicated add#

max#

Oddity#