C++11 Convolution

Convolution

Image convolution is one the easiest techniques to learn in computer vision and it's a technique that illustrates the power of local feature detection better than weeks of lecture. In principle, convolution involves nothing more than repeatedly applying an NxN matrix to a pixel and its neighbors. Depending on the matrix scalars several different effects can be achieved:

  • Blur
  • Edge Detection <--- Let's focus on this
  • Sharpen

Edge detection is one of the fundamental building blocks of image processing, computer vision, and pattern recognition. If you think about it, human vision really boils down to a few key components: edge detection, depth perception, and color. By understanding the boundary between an object and its surroundings we are able to perceive features that inform our mental model of the world and help us classify objects. The same is true for computer vision.

Figure 1: The results of Sobel edge detection (right) on an American quarter (left).

Let's say you've been asked to extract the quarter from the above image (Figure 1). Seems easy enough, there is an intuitive boundary between the quarter and the cloudy background but how do you convey that to a PC? Maybe look for differences between black and white, but remember that there are greys/colors here too and a slight black shadow caused by the sheen of the coin. The image on the right of Figure 1 is the same coin after applying a convolution matrix, in this case Sobel's edge filter. It might not seem like it but there the image on the right has much stronger boundaries between regions (no clouds). Going a step further (Figure 2) I'll apply a binary filter to show how the Fallout ranger in the background is almost invisible and the quarter is much more pronounced.

Figure 2 A binary-thresholded image of the quarter from Figure 1

C++11 Convolution

I don't want to stray too far from my original intent, to show how to do convolutions in C++11. First I should mention that I'm not implying C++11 contains standard libraries for image processing. Instead C++11 introduced native support for threading and it just so happens that convolutions are stupidly easy to parallelize. My my, how convenient for me.

This is a standard implementation of basic edge detection. It requires the header-only CImg library. I'll break the code down afterwards:

typedef CImg<unsigned char> Image;  
const float sobel_mx[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}};  
const float sobel_my[3][3] = {{1,2,1},{0,0,0},{-1,-2,-1}};

void edge_detection(const std::string& imageName,const float(& mx)[3][3], const float(& my)[3][3])  
{
    Image imageInput(imageName.c_str());
    Image imageOutput(imageInput.width(), imageInput.height(),1,3);

    // CImg's custom for-loop
    cimg_forXY(imageInput,x,y){
      int r=0,c=0;
      int dx=0,dy=0;
      int dx_dy=0;
      int complexity=0;

      if(x == 0 || x == imageInput.width()-1) dx_dy = 0;            //edge pixel, ignore
      else if(y == 0 || y == imageInput.height()-1) dx_dy = 0;    //edge pixel, ignore
      else {
          //dx - boundary changes in the x-axis
          for(r = -1; r < 2; r++){
              for(c = -1; c < 2; c++){
                  dx += imageInput(x+c,y+r,0) * mx[r+1][c+1];
              }
          }
          //dy - boundary changes in the y-axis
          for(r = -1; r < 2; r++){
              for(c = -1; c < 2; c++){
                  dy += imageInput(x+c,y+r,0) * my[r+1][c+1];
              }
          }

          // approximate dx/dy
          dx_dy = abs(dx) + abs(dy);

          // clamp range to [0,255]
          dx_dy %= 255;
      }
      imageOutput(x,y,0) = imageOutput(x,y,1) = imageOutput(x,y,2) = dx_dy;
   }
}

//Usage: edge_detection("lena.bmp",sobel_mx,sobel_my);

Let me break the code down quickly. First I pre-populate my convolution matrices with values from the Sobel Filter:

const float sobel_mx[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}};  
const float sobel_my[3][3] = {{1,2,1},{0,0,0},{-1,-2,-1}};  

Next, in edge_detection, I create the input and output images and start iterating over inputImage:

    Image imageInput(imageName.c_str());
    Image imageOutput(imageInput.width(), imageInput.height(),1,3);

    cimg_forXY(imageInput,x,y){
        //... core here
    }

Sobel's Filter is a 3x3 matrix edge pixel neighbors will be out of range depending along some axis. Edge pixles are handled in 1 of 2 ways: Apply the convolution matrix only to valid neighboring pixels or just ignore edge pixels entirely (the method I use):

      if(x == 0 || x == imageInput.width()-1) dx_dy = 0;        //edge pixel, ignore
      else if(y == 0 || y == imageInput.height()-1) dx_dy = 0;    //edge pixel, ignore

Last is the application of the convolution matrix over the target pixel and its neighbors:

   //dx - boundary changes in the x-axis
   for(r = -1; r < 2; r++){
     for(c = -1; c < 2; c++){
         dx += imageInput(x+c,y+r,0) * mx[r+1][c+1];
     }
   }
   //dy - boundary changes in the y-axis
   for(r = -1; r < 2; r++){
     for(c = -1; c < 2; c++){
         dy += imageInput(x+c,y+r,0) * my[r+1][c+1];
     }
   }
   // approximate dx/dy
   dx_dy = abs(dx) + abs(dy);

   // clamp range to [0,255]
   dx_dy %= 255;

For offline image processing the algorithm above is as complicated as it needs to be. It's slow but many image processing techniques have a lower-bounds of O(n) and the method above is no different. Every pixel needs to be filtered for edge detection to be correct. Any optimizations we'll likely realize will thus necessitate a shift in how to approach the problem. The key observation you should take away from the algorithm above is:

outputImage pixel(xi,yi) does not depend on any other pixels in outputImage.

If that doesn't scream parallelize me I don't know what does. imageInput is treated as read-only, output from the convolution matrix is applied directly to imageOutput and these values are not used by any future iterations of the edge detection algorithm.

Using the edge_detection() function above as a template I'm going to move the core convolution matrix multiplication to a pool of worker threads. Normally this would be an ideal application for the GPU but I promised to use C++11.

void edge_detection_threaded(const std::string& imageName,const float(& mx)[3][3], const float(& my)[3][3]){  
    Image imageInput(imageName.c_str());
    Image imageOutput(imageInput.width(), imageInput.height(),1,3);

    std::mutex imageMutex;

    int size = imageInput.width() * imageInput.height();    //don't use imageInput.size(), it includes depth(# of per pixel channels)
    int width = imageInput.width();
    int height = imageInput.height();

    // capture our lambda
    auto cimg_iter = [&](unsigned int threadID){
        int y = threadID / width;
        int x = threadID - (y * width);
        int dx_dy=0;
        int r=0,c=0;
        int dx=0,dy=0;
        int dx_dy=0;
        int complexity=0;

        if(x == 0 || x == imageInput.width()-1) dx_dy = 0;            //edge pixel, ignore
        else if(y == 0 || y == imageInput.height()-1) dx_dy = 0;    //edge pixel, ignore
        else {
            //dx
            for(r = -1; r < 2; r++){
                for(c = -1; c < 2; c++){
                    dx += imageInput(x+c,y+r,0) * mx[r+1][c+1];
                }
            }
            //dy
            for(r = -1; r < 2; r++){
                for(c = -1; c < 2; c++){
                    dy += imageInput(x+c,y+r,0) * my[r+1][c+1];
                }
            }
            //Complexity.   0:=Approximation  | 1-Inf:=No approximation
            if(complexity == 0) {
                dx_dy = abs(dx) + abs(dy);
            }
            else {
                dx_dy = (dx * dx) + (dy * dy);
                dx_dy = sqrt((float)(dx_dy));
            }
            // clamp range to [0,255]
            dx_dy %= 255;
        }
        imageOutput(x,y,0) = imageOutput(x,y,1) = imageOutput(x,y,2) = dx_dy;
    };

    parallelFor(size,cimg_iter);
}
// code adapted from this SO post: http://stackoverflow.com/questions/17235053/c11-alternative-to-openmp-with-clang
void parallelFor(const unsigned int size, std::function<void(const unsigned int)> func) {

      const unsigned int nbThreads = std::thread::hardware_concurrency(); //limit threads to CPU concurrency
    std::vector < std::thread > threads;
    for (unsigned int idThread = 0; idThread < nbThreads; idThread++) {
        auto threadFunc = [=]() {
            for (unsigned int i=idThread; i<size; i+=nbThreads) {
                func(i);
            }
        };
        threads.push_back(std::thread(threadFunc));
    }
    for (auto & t : threads) t.join();
}

Normally I wouldn't have a 30 line lambda in my code but I got lazy with the copy-paste. You can see that I didn't change much, I moved the convolution matrix multiplication block inside of a lamdba function that is then given to parallel for-loop. parallelFor spawns as many threads as this machine has cores and off we go.

Here is some quick timing data based on sample image of 1500x1200 pixels:

  • Average time to detect edges (no threading): 15 seconds
  • Average time to detect edges (w/ threading): 7 seconds

A 2x speed increase is not too shabby considering this code is not optimized and was hacked together in 30 minutes.

Full Code:

#include <algorithm>
#include <functional>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <cmath>
#include <thread>
#include <vector>
#include <random>
#include <chrono>
#include <string>
#include <mutex>
#include <chrono>

#include "CImg.h"

using namespace std;  
using namespace std::chrono;  
using namespace cimg_library;

typedef CImg<unsigned char> Image;  
typedef std::pair<int, int> PixelPair;  
typedef std::map<PixelPair,bool> PixelMap;  
typedef std::chrono::high_resolution_clock Clock;

const float sobel_mx[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}};  
const float sobel_my[3][3] = {{1,2,1},{0,0,0},{-1,-2,-1}};  
const float prewitt_mx[3][3] = {{-1,0,1},{-1,0,1},{-1,0,1}};  
const float prewitt_my[3][3] = {{-1,-1,-1},{0,0,0},{1,1,1}};

void parallelFor(const unsigned int size, std::function<void(const unsigned int)> func) {

      const unsigned int nbThreads = std::thread::hardware_concurrency(); //limit threads to CPU concurrency
    std::vector < std::thread > threads;
    for (unsigned int idThread = 0; idThread < nbThreads; idThread++) {
        auto threadFunc = [=]() {
            for (unsigned int i=idThread; i<size; i+=nbThreads) {
                func(i);
            }
        };
        threads.push_back(std::thread(threadFunc));
    }
    for (auto & t : threads) t.join();
}

// Apply 3x3 convolution matrix(kernel) to image data
//
// Function lacks sophistication, each pixel is treated
// as if it's in vacuum.
//
// Kludge Note:  Only pixels in the RED channel are used
// to calculate edges.  This saves on converting RGB images
// to grayscale.
int edge_detection_kernel(int x, int y, const float(& mx)[3][3], const float(& my)[3][3],Image &imageInput){  
    int r=0,c=0;
    int dx=0,dy=0;
    int dx_dy=0;
    int complexity=0;

    if(x == 0 || x == imageInput.width()-1) dx_dy = 0;          //edge pixel, ignore
    else if(y == 0 || y == imageInput.height()-1) dx_dy = 0;    //edge pixel, ignore
    else {
        //dx
        for(r = -1; r < 2; r++){
            for(c = -1; c < 2; c++){
                dx += imageInput(x+c,y+r,0) * mx[r+1][c+1];
            }
        }
        //dy
        for(r = -1; r < 2; r++){
            for(c = -1; c < 2; c++){
                dy += imageInput(x+c,y+r,0) * my[r+1][c+1];
            }
        }
        //Complexity.   0:=Approximation  | 1-Inf:=No approximation
        if(complexity == 0) {
            dx_dy = abs(dx) + abs(dy);
        }
        else {
            dx_dy = (dx * dx) + (dy * dy);
            dx_dy = sqrt((float)(dx_dy));
        }
        // clamp range to [0,255]
        dx_dy %= 255;
    }
    return dx_dy;
}

// Apply an edge detection kernel to an image using multiple threads
//
// Input
//    string imageName - Name of the image to load (supports .jpeg, .bmp, .png)
//    float mx[3][3] - 3x3 convolution matrix for pixels along the X axis
//    float my[3][3] - 3x3 convolution matrix for pixels along the Y axis
void edge_detection_threaded(const std::string& imageName,const float(& mx)[3][3], const float(& my)[3][3]){  
    Image imageInput(imageName.c_str());
    Image imageOutput(imageInput.width(), imageInput.height(),1,3);

    std::mutex imageMutex;

    int size = imageInput.width() * imageInput.height();    //don't use imageInput.size(), it includes depth(# of per pixel channels)
    int width = imageInput.width();
    int height = imageInput.height();

    auto cimg_iter = [&](unsigned int threadID){
        int y = threadID / width;
        int x = threadID - (y * width);
        int dx_dy=edge_detection_kernel(x,y,mx,my,imageInput);
        imageOutput(x,y,0) = imageOutput(x,y,1) = imageOutput(x,y,2) = dx_dy;
    };

    parallelFor(size,cimg_iter);

    //Warning: Invalidates timings (obviously)
    #if defined(VALIDATE_EDGE_DETECTION)
        CImgDisplay main_disp(imageOutput,"Edges");
        while (!main_disp.is_closed()) {}
    #endif
}

// Apply an edge detection kernel to an image
//
// Input
//    string imageName - Name of the image to load (supports .jpeg, .bmp, .png)
//    float mx[3][3] - 3x3 convolution matrix for pixels along the X axis
//    float my[3][3] - 3x3 convolution matrix for pixels along the Y axis
void edge_detection(const std::string& imageName,const float(& mx)[3][3], const float(& my)[3][3]){  
    Image imageInput(imageName.c_str());
    Image imageOutput(imageInput.width(), imageInput.height(),1,3);

    cimg_forXY(imageInput,x,y){
        int dx_dy=edge_detection_kernel(x,y,mx,my,imageInput);  
        imageOutput(x,y,0) = imageOutput(x,y,1) = imageOutput(x,y,2) = dx_dy;
    }

    //Warning: Invalidates timings (obviously)
    #if defined(VALIDATE_EDGE_DETECTION)
        CImgDisplay main_disp(imageOutput,"Edges");
        while (!main_disp.is_closed()) {}
    #endif
}

int main(int argc, char *argv[]) {

    string imageName = (argc==1)?"lena.bmp":argv[1];
    printf("Image: %s\n",imageName.c_str());
    duration<double> time_span_nothread,time_span_thread;
    {
        auto t1 = Clock::now();

        edge_detection(imageName,prewitt_mx,prewitt_my);
        auto t2 = Clock::now();
        time_span_nothread = duration_cast<std::chrono::duration<double>>(t2 - t1);
    }

    {
        auto t1 = Clock::now();

        edge_detection_threaded(imageName,prewitt_mx,prewitt_my);
        auto t2 = Clock::now();
        time_span_thread = duration_cast<std::chrono::duration<double>>(t2 - t1);
    }
    printf("Edge Detection(No Thread):%f\n",time_span_nothread.count());
    printf("Edge Detection( Threaded):%f\n",time_span_thread.count());
    return 0;
}
View Comments