1 Learned Dynamic Guidance For Depth Image

5m ago
8.56 MB
16 Pages

1Learned Dynamic Guidance for Depth ImageReconstructionShuhang Gu1 , Shi Guo2 , Wangmeng Zuo2 , Yunjin Chen3 , Radu Timofte1 , Luc Van Gool1,5 , Lei Zhang41 Computer Vision Lab, ETH Zurich, 2 Harbin Institute of Technology,3 ULSee Inc., 4 The Hong Kong Polytechnic University, 5 KU Leuven.Abstract—The depth images acquired by consumer depth sensors (e.g., Kinect and ToF) usually are of low resolution and insufficientquality. One natural solution is to incorporate a high resolution RGB camera and exploit the statistical correlation of its data and depth.In recent years, both optimization-based and learning-based approaches have been proposed to deal with the guided depthreconstruction problems. In this paper, we introduce a weighted analysis sparse representation (WASR) model for guided depth imageenhancement, which can be considered a generalized formulation of a wide range of previous optimization-based models. We unfoldthe optimization by the WASR model and conduct guided depth reconstruction with dynamically changed stage-wise operations. Sucha guidance strategy enables us to dynamically adjust the stage-wise operations that update the depth image, thus improving thereconstruction quality and speed. To learn the stage-wise operations in a task-driven manner, we propose two parameterizations andtheir corresponding methods: dynamic guidance with Gaussian RBF nonlinearity parameterization (DG-RBF) and dynamic guidancewith CNN nonlinearity parameterization (DG-CNN). The network structures of the proposed DG-RBF and DG-CNN methods aredesigned with the the objective function of our WASR model in mind and the optimal network parameters are learned from pairedtraining data. Such optimization-inspired network architectures enable our models to leverage the previous expertise as well as takebenefit from training data. The effectiveness is validated for guided depth image super-resolution and for realistic depth imagereconstruction tasks using standard benchmarks. Our DG-RBF and DG-CNN methods achieve the best quantitative results (RMSE)and better visual quality than the state-of-the-art approaches at the time of writing. The code is available athttps://github.com/ShuhangGu/GuidedDepthSRF1I NTRODUCTIONHigh quality, dense depth images play an important role in manyreal world applications such as human pose estimation [1], handpose estimation [2], [3] and scene understanding [4]. Traditionaldepth sensing is mainly based on stereo or lidar, coming with ahigh computational burden and/or price. The recent proliferationof consumer depth sensing products, e.g., RGB-D cameras andTime of Flight (ToF) range sensors, offers a cheaper alternativeto dense depth measurements. However, the depth images generated by such consumer depth sensors are of lower quality andresolution. It therefore is of great interest whether depth imageenhancement can make up for those flaws [5], [6], [7], [8], [9],[10], [11]. To improve the quality of depth images, one categoryof methods [5], [6] utilize multiple images from the same sceneto provide complementary information. These methods, however,heavily rely on accurate calibration and are not applicable indynamic environments. Another category of approaches [7], [8],[9], [11], [12] introduce structure information from a guidanceimage (for example, an RGB image) to improve the quality of thedepth image. As in most cases the high quality RGB image can beacquired simultaneously with the depth image, such guided depthreconstruction has a wide range of applications [13].A key issue of guided depth enhancement is to appropriatelyexploit the structural scene information in the guidance image.By incorporating the guidance image in the weight calculatingstep, joint filtering methods [14], [15], [16], [17] directly transferstructural information from the intensity image to the depthimage [18], [19]. Yet, due to the complex relationship betweenthe local structures of intensity and depth, such simple jointfiltering methods are highly sensitive to the parameters, and oftencopy unrelated textures from the guidance image into the depthestimation. To better model the relationship between the intensityimage and the depth image, optimization-based methods [7], [8],[9] adopt objective functions to characterize their dependency.Although the limited number of parameters in these heuristicmodels has restricted their capacity, these elaborately designedmodels still capture certain aspects of the joint prior, and havedelivered highly competitive enhancement results. Recently, discriminative learning solutions [10], [20], [21], [22] have also beenproposed to capture the complex relationships between intensityand depth. Due to the unparalleled non-linear modeling capacityof deep neural networks as well as the paired training data,deep learning based methods [21], [22] have achieved betterenhancement performance than conventional optimization-basedapproaches.To deal with the guided depth reconstruction task, recent solutions [20], [21], [22] utilize deep neural networks (DNN) to buildthe mapping function from the low quality inputs and the guidanceimages to the high quality reconstruction results. As for otherdense estimation tasks [23], [24], [25], an appropriate networkstructure plays a crucial role in the success of the DNN-basedguided depth reconstruction system. Recently, a large number ofworks [25], [26], [27], [28] have shown that some successfuloptimization-based models could provide useful guidelines fordesigning network architectures. By unrolling the optimizationprocess of variational or graphical models, network structures havebeen designed to solve image denoising [26], [27], compressivesensing [29] and semantic segmentation [25]. These networksemploy domain knowledge as well as paired training data and

2Fig. 1. Illustration of the unfolded optimization process of a WASR model. The WASR model takes low quality depth estimation Y and guidanceintensity image G as input, aims to achieve a high quality depth image X . Each step of the optimization process can be termed as a stage-wiseoperation. By dynamically changing the stage-wise operation, we construct the DG-RBF and DG-CNN model for fast and accurate guided depthreconstruction.have achieved state-of-the-art performance for different tasks. Inthis paper, we analyze and generalize previous optimization-basedapproaches, and propose better network structures to deal with theguided depth reconstruction task.Work related to this paper is that of Riegler et al. [30],which unrolls the optimization steps of a non-local variationalmodel [31] and proposes a primal-dual network (PDN) to dealwith the guided depth super-resolution task. Yet, PDN followsthe unrolled formula of the non-local regularization model [31]strictly, and only adopts the pre-defined operator (Huber norm) topenalize point-wise differences between depth pixels. As a result,the PDN method [30] has limited flexibility to take full advantageof paired training data. In this paper, we propose a more flexiblesolution to exploit paired training data as well as prior knowledgefrom previous optimization-based models. We analyze previousdependency modeling methods and generalize them as a weightedanalysis sparse representation regularization (WASR) term. Byunfolding the optimization process of the WASR model, we get theformula of a stage-wise operation for guided depth enhancement,and use it as departure point for our network structure design.In Fig. 1, we provide a flowchart of the general formula of theunfolded optimization process of the WASR model. Each iterationof the optimization algorithm can be regarded as a stage-wiseoperation to enhance the depth map.WASR is a generalized model which shares many ofthe characteristics common to previous optimization-based approaches [7], [32]. Unfolding its optimization process provides uswith a framework to leverage the previous expertise while leavingour model enough freedom to take full advantage of training data.With the general formula of the stage-wise operation established,we adopt two approaches to parameterize the operations. The firstapproach parameterizes the unfolded WASR model in a directway. Based on the unfolded optimization process, the stage-wiseoperations consist of simple convolutions and nonlinear functions.We learn the filters and nonlinear functions (parameterized asthe summation of Gaussian RBF kernels [26], [27]) for eachstage-wise operation, in a task-driven manner. Although suchmodel shares its formula for the optimization with a simpleWASR model, its operations are changed dynamically to accountfor the depth enhancement. As a result, it can generate betterenhancements in just a few stages. In the remainder of this paper,we denote this model as dynamic guidance with RBF nonlinearityparameterization (DG-RBF). An illustration of one stage of theDG-RBF operation can be found in Fig. 2.Besides the DG-RBF model, we also propose to parameterizethe stage-wise operation in a loose way. In particular, we analyzethe stage-wise operation’s formula and divide the operation intothree sub-components: the depth encoder, the intensity encoderand the depth decoder. Instead of using one large filter and onenonlinear function to form the encoder and the decoder in thestage-wise operation, we use several layers of convolutional neuralnetworks (CNN) to improve the capacity of each sub-component.The overall model of this dynamic guidance with CNN nonlinearity parameterization (DG-CNN) is designed based on theunfolded optimization process of the WASR model, while its subcomponents are parameterized with powerful CNNs. As DG-CNNbuilds upon the conventional optimization-based approach and therecent advances in deep learning, it generates better enhancementresults than the existing methods. An illustration of a two stageDG-CNN model can be found in Fig. 3, details of the networkswill be introduced in section 5.The formula of the WASR model and some experimentalresults of the DG-RBF method have been introduced in our earlierconference paper [33]. In this paper, we provide more informationabout the WASR model and DG-RBF method, and provide theDG-CNN approach, a new parameterization of the WASR model.Due to its unparalleled nonlinearity modeling capacity, CNNbased parameterization often generates better enhancement resultsthan the Gaussian RBF based method, especially in challengingcases with large zooming factors. Furthermore, the well optimizeddeep learning tool box makes the CNN based method (DG-CNN)more efficient than DG-RBF in both training and testing.The contributions of this paper are summarized as follows: By analyzing previous guided depth enhancement methods, we formulate the dependency modeling of depth andRGB images as a weighted analysis sparse representation(WASR) model. We unfold the optimization process ofthe WASR objective function, and propose a task-driventraining strategy to learn stage-wise dynamic guidancefor different tasks. A Gaussian RBF kernel nonlinearitymodeling method (DG-RBF) and a special CNN (DGCNN) are trained to conduct depth enhancement at eachstage.We conduct detailed ablation experiments to analyze themodel hyper-parameters and network architecture. Theexperimental results clearly demonstrate the effectivenessof the optimization-inspired network architecture design.Experimental results on depth image super-resolution andnoisy depth image reconstruction validate the effectiveness of the proposed dynamic guidance approach. Theproposed algorithm achieves the best quantitative andqualitative depth enhancement results among the state-ofthe-art methods that we compared to.

3The rest of this paper is organized as follows. Section 2briefly introduces some related work. Section 3 analyzes previousobjective functions of guided depth enhancement approaches, andintroduces the task-driven formulation of the guided depth enhancement task. By unrolling the optimization process of the taskdriven formulation, Sections 4 and 5 introduce two parameterization approaches, i.e. parameterize the nonlinear operation in eachstep with Gaussian RBF kernels or parameterize each gradientdescent stage with convolutional neural networks. Section 6 conducts ablation experiments to analyze the model hyper-parametersand to show the advantage of the optimization-inspired networkarchitecture design. Sections 7 and 8 provide experimental resultsof the different methods for guided depth super-resolution andenhancement. Section 9 discusses the DG-RBF and DG-CNNmodels. Section 10 concludes the paper.2R ELATED WORKIn this section, we introduce related work. We start by brieflysurveying the analysis representation model literature to thenreview prior guided depth enhancement methods. Finally, we discuss previous work on optimization-inspired network architecturedesign.2.1Analysis sparse representationSparse analysis representations have been widely applied in imageprocessing and computer vision tasks [26], [27], [34], [35], [36],[37]. An analysis operator [38] operates on image patches oranalysis filters [36], [39] operate on whole images to model thelocal structure of natural images. Compared with sparse synthesisrepresentations, the analysis model adopts an alternative viewpoint for union-of-subspaces reconstruction by characterizing thecomplement subspace of signals [40], and usually results in moreefficient solutions.Here we only consider the convolutional analysis representation, with one of its representative forms given by:X XX̂ arg min L(X, Y ) ρl ((kl X)i ),(1)Xliwhere X is the latent high quality image and Y is its degradedobservation. denotes the convolution operator, and (·)i denotesthe value at position i. The penalty function ρl (·) is introducedto characterize the analysis coefficients of latent estimation,which are generated by the analysis dictionaries {kl }l 1,.,Lin a convolutional manner. L(X, Y ) is the data fidelity termdetermined by the relationship between X