Illumination Invariant Pre-transformations for Automotive Image Segmentation

8 min readApr 15, 2021

Motivation

In this blog we try to reproduce the paper by Alshammari at al. This is done because it leads to a better understanding of the content of the paper. Besides that, it is important for papers to be independently reproducible, meaning reproducible only looking at the content of the paper and not at the code. This ensures that all concepts are explained clearly in the paper and that other people can get the same results. If a paper is not reproducible the claims made in the paper are not easily checked by others and therefore the results might be invalid.

Goal of the paper

This paper aims to show the influence of pre-transforming images to make them illumination invariant for the automotive scene understanding, more specifically, scene segmentation. This is necessary because in practice, weather conditions vary, and this has a negative effect on the accuracy of the state of the art neural networks for automotive scene understanding. Illumination invariant images are supposed to look the same regardless of the lighting condition. And the paper shows the resulting improvements using them.

Dataset

The study uses the CamVid dataset [1]. This dataset is filmed in Cambridge, originally with a resolution of 960 x 720 at 30 Hz. The images are labeled from 5 sequences at 1 Hz, leaving 701 labeled images. These are split into train, validation and test sets of 367, 101, 233 images respectively. The labeling distinguishes 32 semantic classes. These and their respective color are shown in the figure below. The CamVid dataset is filmed in a variety of weather conditions, so it is well suited for testing the illumination invariant transform methods.

The study only tries to identify the 11 most common classes using the dataset. The remaining 21 classes are therefore not used by the neural network. An example from the labeled images is also shown below.

The labeled classes and their colours, from the CamVid dataset

Network Architecture

The image segmentation is achieved using the deep convolutional neural network SegNet. [2] This neural network is built for pixel wise segmentation, which is exactly what we need for our problem. Within the network a VGG16 network that is pre-trained on ImageNet images is used as an encoder. The decoder is made using convolution and upsampling layers. The input of the network is a 3 channel image, so the total dimensions are 3 x 480 x 360. As mentioned before this study uses 11 of the 32 classes in the segmentation problem. These are sky, building, pole, road, pavement, tree, sign, fence, car, pedestrian, bicycle.

Furthermore, the network training uses stochastic gradient descent with the following hyper-parameters:

Learning rate = 10^-3
Weight decay = 5*10^-4
Momentum = 0.9

We also use median frequency balancing as is mentioned in [2] and is explained in [7]

Pre-transformations

In the paper four different kinds of transforms are discussed. These transforms all use all three channels to compute one illumination invariant channel. The names of the methods are: Álvarez, Maddern, Krajnik, Finlayson. These and the regular RGB images are first used to train the CNN. Then the paper creates 4 extra kinds of input images by adding 2 channels to the four already discussed illumination invariant channels, namely hue and saturation (coming from the HSV version of the images, which is just another representation of the RGB images). So in total 9 different image representations are tested. We only applied 2 of the 4 transforms, namely Maddern and Álvarez, resulting in 5 different image representation

Base RGB image

The RGB image is already in 3 channels, however as the paper of SegNet suggests, local contrast normalization [3] has to be applied on the image. This makes the image more robust to variations in light intensity, so is a form of an illumination invariant transform. It enforces a sort of local competition between adjacent features in a feature map and between features at the same spatial location in the different feature maps. It is basically an additional convolution layer that makes use of a 9 x 9 Gaussian weighting window, which sums to 1. producing a change as shown below.

Maddern Transform [4]

Secondly, the illumination invariant transform that we implemented was the Maddern Transform. This uses the log inputs of the 3 RGB channels to produce the one channel illumination invariant image as can be seen below, where α = 0.48

Alvarez transform [5]

The other illumination invariant transform that was implemented was the Alvarez transform. This uses not only log inputs, but as the angle theta, known as the invariant direction. This theta is depended on the camera sensor that is uses. Álvarez uses θ = 42.3° in its paper, which we could use for our dataset, however as the angle is depended on the camera and a different camera is used for the CamVid data, it is inapplicable

We opted to compute this angle ourselves and as we only had one configuration colorchecker image available from the CamVid dataset, we had to do it per image. This can be done by mapping all pixels to a log chromaticity space, in which we use entropy minimization to determine the angle that fits best. The formula for the channel output can be seen below, as well as an example image.

Adding Hue and Saturation channels [6]

As mentioned before, all four illumination invariant transform methods were also modified by adding hue and saturation channels to the image. The formulas of those is shown below.

Other transforms

As mentioned before two other tranforms are compared in the paper, however as we have not applied those, we will not go over them here.

Reproduction Results

The resulting metrics of the results from the original paper and our results are shown below. Because of time constraints we were only able to test the Maddern transform, even though we also were able to use the Álvarez transform.

It can be seen that the RGB from our reproduction actually performed better than the paper. Furthermore, both the original Maddern transform and its HS counterpart perform far worse in our version that in the paper. Why the results are so different is discussed below.

Discussion

There is a variety of possible reasons why we did not come to come to the same results are the authors of the paper. Many come from insufficient descriptions about the steps they took. Some arise from the lack of knowledge we have on the subject of color transforms.

The first problem that arose was the lack of explanation the authors give about the classes that are not used. As mentioned before the segmentation network uses 11 classes, however the CamVid dataset includes 32 classes. Should the 21 unused classes be applied as a seperate remainder class? Should some classes like cars and busses be merged into one class? We don’t know, but did the first. This could have an influence on the metrics as we chose to not include the twelfth class in the metrics.

Secondly, the makers of SegNet suggest using local contrast normalization on the RGB images, however this is not mentioned in the paper we tried to reproduce, so we were unsure if we should use it. Non the less we did implement it, however the standard deviation for the Gaussian in the local contrast normalization was not given, so we just put 2 there. The addition of the local contrast normalization could explain the different results on the RGB images. In hindsight, we think the authors might not have used it, and therefore their network performs worse there.

Thirdly, the authors say they use the pre-trained SegNet to train on the various transforms, however this network is not compatible with images with one channel, which the original transform methods create. We therefore, as an easy fix, added two empty layers, to make them compatible. A better solution could have been changing the network architecture and not using the pre-trained network, we were not able to test this in time.

Where our experience proved insufficient was when trying to implement all transforms. 3 out of the 4 transforms make use of the invariant direction/angle which is a device specific variable, that can be computed in multiple ways. Which method is used by the authors is not entirely clear to us, so we used a method that for which the code was already available online, which computes the invariant angle per image.

Task Division

Joris: coding illumination invariant transforms and other image pre-processing, writing blog
Remco: coding network architecture with dataset loader, integrating the different components

References

[1] G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence
[3] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in ICCV, pp. 2146– 2153, 2009.
[4] W. Maddern, A. Stewart, C. McManus, B. Upcroft, W. Churchill, and P. Newman, “Illumination invariant imaging: Applications in robust vision-based localisation, mapping and classification for autonomous vehicles,”
[5] J. Alvarez and A. Lopez, “Road detection based on illuminant in- ´ variance,” IEEE Trans. on Intelligent Transportation Systems
[6] Wikipedia, “HSL and HSV”, https://en.wikipedia.org/wiki/HSL_and_HSV
[7] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
labels with a common multi-scale convolutional architecture,” in ICCV,
pp. 2650–2658, 2015