Denoising Video with RNNs – a Digital Signal Processing prime

I will be honest with this post today: I may still need to be more transparent and clear with my goals when doing these Blog posts. For instance I need to disclose more about my true background. Ok, there it goes: I am a Physics Engineering graduate from a relatively important University from Lisbon, Portugal. I was not a very successful sophomore, but I did manage to graduate and I did my specialization in the field of Optoelectronics, with a thesis on Laser profiling and measurement. So I was involved in questions pertaining to how to proper construct a picture from a light source, in that case it was an high-power CO2 Laser source, that is, it was advanced signal processing. At that time the field of Digital Signal processing was still very much in its infancy.

All  this is to introduce the paper I want to review today in this Blog, which has to do with advances in Digital Signal Processing. But and in the spirit of more recent posts here in The Information Age, this paper combines precisely those developments with an important topic, video denoising with Deep Neural Networks architectures, this time recurrent neural networks (RNNs); last but not least, the paper was published in a familiar publication to me that is SPIE (the international Society for Optics and Photonics):

Deep RNNs for video denoising

Abstract:

Video denoising can be described as the problem of mapping from a specific length of noisy frames to clean one. We propose a deep architecture based on Recurrent Neural Network (RNN) for video denoising. The model learns a patch-based end-to-end mapping between the clean and noisy video sequences. It takes the corrupted video sequences as the input and outputs the clean one. Our deep network, which we refer to as deep Recurrent Neural Networks (deep RNNs or DRNNs), stacks RNN layers where each layer receives the hidden state of the previous layer as input. Experiment shows (i) the recurrent architecture through temporal domain extracts motion information and does favor to video denoising, and (ii) deep architecture have large enough capacity for expressing mapping relation between corrupted videos as input and clean videos as output, furthermore, (iii) the model has generality to learned different mappings from videos corrupted by different types of noise (e.g., Poisson-Gaussian noise). By training on large video databases, we are able to compete with some existing video denoising methods. © (2016) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.

It was obviously a no brainer to cover this paper. And once again we are dealing with novel approaches and methods in a scientific computing topic, the application of deep learning for video denoising, in this case deep end-to-end RNNs

Nowadays, deep learning has made great progress in computer vision and pattern recognition application (e.g., image classification using deep convolutional networks1 ), thanks to its enormous expression ability for representation and fast running on Graphics Processing Units (GPUs). How to apply deep learning on exploring time sequence data and mining temporal information is a popular topic. RNNs are a superset of feedforward neural networks, with the ability to pass information across time steps. They are first widely used in language processing domain like speech recognition,2 image description.3 In computer vision domain, Nitish Srivastava et al.4 verifies that RNNs have ability to learn both motion information and profile feature from videos and successfully exploits such representation to perform pattern recognition, which motivates us to propose a novel patch-based deep RNNs model for video sequence denoising. To the best of our knowledge, our method is the first one proposing an end-to-end deep RNNs for video denoising processing. The procedure is to take a few continuous noisy frames as input and outputs images where the noise has been reduced. Results show that removing on additive white Gaussian noise are competitive with the current state of the art. The approach is equally valid for other type of noise.

Model description

Knowing full well the challenges involved with video denoising, as it is different from image denoising where approaches such as sparse coding methods, conditional random fields, variation techniques or patch based methods work pretty well, and where great progress have been achieved by removing the noise with wavelet shrinkage or Wiener filtering, the authors were faced with the need to use a different method: namely such methods that properly compensate for global motion compensation and high-temporal redundancy:

Video denoising is different as video sequences have motion information and high temporal redundancy to be exploited during noise removing procedure. A common method of patch-based image denoising can be applied on video denoising by searching for similar patched on different frames over time.  Liu and Freeman, also use motion vectors and group patches across adjacent frames but in a different manner. Instead of comparing patches to the reference patch, these are compared in each frame with the compensated patch of the reference one. NL-means14 is applied to this group of collected patches. The proposed algorithm adaptively computes the noise model of the sequence, which is an important issue for real applications. Gijiesh and Zhou15 took the whole adjacent frames into account by using motion estimation for global motion compensation. Then wavelet transform is applied to the piled frames including current frame and past/future frames. Mairal et al.16 learnt multi-scale sparse representations for video restoration. VBM3D12 evolves from BM3D method by matching similar patches both within images and over different timestep frames by predictive search blockmatching. VBM4D, the state-of-the-art for white noise removal in video, evolved from VBM3D by exploiting similarity between 3D spatia-temporal volumes instead of 2D patches and grouped similar volumes together by stacking them along an additional fourth dimension, thus producing a 4D structure. Collaborative filtering is realized by transforming each group through a decorrelating 4D separable transform and then by shrinkage and inverse transformation. Low rank matrix completion is another practicable method combining patch-based methodology. It grouped similar patches in temporal-spatial field and minimize the nuclear norm (`1 norm of all singular values) of the matrix with linear constraints, leads to state-of-the-art performance in mixed noises denoising. Another way to utilize spatiotemporal information is to combine with motion estimation.The model simultaneously captured the local correlations between the wavelet coefficients of natural video sequences across both space and time, strengthened with a motion compensation process

This  paragraph from the paper is a treasure trove of deep cutting-edge topics in Digital Signal processing, with some two or three topics each one deserving  more than a Blog post, just short of full-time courses on their own. But the focus for the model description of interest to us here deals with the application of deep recurrent neural networks algorithms to achieve better performance with the video denoising task:

As recurrent neural network(RNN) can model long-term contextual information for video sequence, a great amount of successes have been made for video processing based on RNN range from high-level recognition, middle level to low-level. Recently, Yan Huang has made breakthrough on multi-frames super-resolution by using a bidirectional recurrent convolutional network for efficient multi-frame SR, it’s a remarkable work for using RNNs model in low-level video processing. To our knowledge, there’s no video denoising by RNNs at current, so our work is the first succeeding in proposing RNNs based model for video denoising

(…)

Deep Recurrent Neural

Networks Recurrent neural networks are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Information can cycle inside the networks during timesteps. Details about recurrent neural networks can be found in a review of RNN.

At time t, hidden layer units h (t) receive activation from input x (t) at current timestep and hidden states h (t−1) at last timestep.

The output y (t) is computed from hidden units h (t) at time t:

h (t) = σ Whxx (t) + Whh (t−1) + bh  (1)

y (t) = σ Wyhh (t) + by (2)

The weight matrices Whx,Whh,Wyh and vector-valued biases bh,by parametrize the RNN, the activation function σ (·) (e.g tanh or sigmoid function) operates component-wise. In our model, All of activation function is hyperbolic tangent function except the output layer uses linear function. The greatest different between RNN and normal neural networks is that the recurrent hidden units’ states are influenced by not only at current inputs but also previous hidden state. As a result, hidden units can been seen as container carry information of previous time sequences. Deep recurrent neural networks(Deep RNNs) is an extended deeper RNN by stacking one input layer, multiple hidden layers and one output layer.

In fact, deep RNNs model stacked different layers in-depth with the same way to multi layer perceptions(MLP). If we get rid of the cycle procedure through timesteps, Deep RNNs becomes to multi layer perceptions. Generally speaking, Deep RNNs have more hidden layers compared to simple RNN. Hidden layer h (t) l (l > 2)) receives layer below h (t) (l−1) and previous hidden states h (t−1) l at current layer as input:

h (t) l = σ Whlhl−1 h (t) l−1 + Whlhlh (t−1) l + bhl

(…)

Applying Deep RNNs for Video Denoising

To denoise video, we decompose a given noisy video into overlapping patch frame-by-frame. Then we feed its corresponding cubes (a list of patches in the same position each frame) into trained model. In other words, inputs can be obtained by sliding through video volume with a 3D window of specific spatial size and temporal step. The denoised video is obtained by placing the denoised patches at the location their noisy counterparts averaging the overlapping regions. When we pick overlapping patches, the stride set to be 3 as denosing performance is almost equally good with smaller stride.

 

After the experimental setup is concluded the proper setting for the scientific computing and Data Science and engineering are in place and ready for the application of the Python library Theano with the help of NVIDIA Titan X GPU:

 

Implementation

Python library Theano28 is used to construct deep RNNs (DRNNs) model. All models were trained on a single NVIDIA Titan X GPU. A two layer deep RNNs took about 10 days to converge. In order to make training procedure more efficient, we apply some common tricks : 

  • Data normalization: The pixel values are transformed to have approximately mean zero and variance one. More precisely, we subtract 0.5 from each pixel and divide by 0.2, assuming pixel values between 0 and 1. 
  • Weight initialization: The neural net’s weights are initialized from a uniform distribution U h − √ 1 n , √ 1 n i where n is the size of the previous layer.
  • Learning Strategy: The learning rate is initialized as 0.03 and stopped as 0.0001, while momentum is initialized as 0.9 and stopped as 0.999. Both of them are decreased at the same step in every epoch. To avoid waiting for max epochs when validation error stopped improving, we use early stopping strategy. We stop training when validation error stopped improve in the latest 200 epoches, and save model at the epoch where validation error reachedthe lowest.

Conclusion

 

Abbreviated:

Deep recurrent neural networks can exploit temporal-spatial information of video to removing video noise, as well as deep architecture have large capacity for expressing mapping relation between corrupted videos as input and clean videos as output. Results of our method approach to state-of-the-art video denoising performance. Our proposed method does not assume any specific statistical properties of noise and is robust to map relation between corrupted video as input and clean video as output.

 

Wonderful topic for this Blog and quite close to a technical and scientific background dear to me and my interests.

 

Featured Image: The Neural Network Zoo – The Asimov Institute

One thought on “Denoising Video with RNNs – a Digital Signal Processing prime

Leave a comment