Lecture Notes On Linear Regression - GitHub Pages

4m ago
455.98 KB
6 Pages

Lecture Notes on Linear RegressionFeng [email protected] University, China1Linear Regression ProblemIn regression problem, we aim at predicting a continuous target value given aninput feature vector. We assume a n-dimensional feature vector is denoted byx 2 Rn , while y 2 R is the output variable. In linear regression models, thehypothesis function is defined byh (x) n xn n1 xn 1 · · · 1 x1 0Geometrically, when n 1, h (x) is actually a line in a 2D plane, while h (x)represents a plane in a 3D space when n 2. Generally, when n3, h (x)defines a so-called “hyperplane” in a higher dimensional space. Suppose2323 nxn6 n 1 76xn 1 767676 . 767 6 . 7 , and x 6 . 767674 1 54 x1 5 01the hypothesis function h (x) can be re-written ash (x) T x(1)where 2 Rn 1 is a parameter vector.It is apparent that the hypothesis function is parameterized by . Sinceour goal is to make predictions according to the hypothesis function given anew test data, we need to find the optimal value of such that the resultingprediction is as accurate as possible. Such a procedure is so-called training.The training procedure is performed based on a given set of m training data{x(i) , y (i) }i 1,··· ,m . In particular, we are supposed to find a hypothesis function(parameterized by ) which fits the training data as closely as possible. Tomeasure the error between h and the training data, we define a cost function(also called error function) J( ) : Rn 1 ! R as follows1 X h (x(i) )2 i 1mJ( ) y (i) 2Our linear regression problem can be formulated as1 X T (i) x2 i 1mmin J( ) 1y (i) 2

Figure 1: 3D linear regression.Specifically, we aim at minimizing J( ) over . We give an illustration in Fig. 1to explain linear regression in 3D space (i.e., n 2). In the 3D space, thehypothesis function is represented by a hyperplane. The red points denote thetraining data, and the distance from the (read) training data to the hyperplaneis denoted by T x(i) y (i) .2Gradient DescentGradient Descent (GD) method is a first-order iterative optimization algorithmfor finding the minimum of a function. If the multi-variable function J( ) isdi erentiable in a neighborhood of a point , then J( ) decreases fastest if onegoes from in the direction of the negative gradient of J at . LetrJ( ) [@J @[email protected] T,,··· ,]@ 0 @ [email protected] n(2)denote the gradient of J( ). In each iteration, we update according to thefollowing rule: rJ( )(3)where is a step size. In more details, j j @J( )@ j(4)The update is terminated when convergence is achieved. In our linear regressionmodel, the gradient can be calculated [email protected]( )@ 1 X T (i) ( [email protected] [email protected] j 2 i 1y (i) )2 mX( T x(i)(i)y (i) )xj(5)i 1We summarize the GD method in Algorithm 1. The algorithm usually starts2

Algorithm 1: Gradient DescentGiven a starting point 2 dom Jrepeat1. Calculate gradient rJ( );2. Update rJ( )until convergence criterion is satisfiedFigure 2: The convergence of GD algorithm.with a randomly initialized . In each iteration, we update such that the objective function is decreased monotonically. The algorithm is said to be convergedwhen the gradient has its magnitude less than or equal to a predefined threshold(say "), i.e.krf (x)k2 "where k · k2 is 2 norm, such that the values of the objective function di er veryslightly in successive iterations. Another convergence criterion is to set a fixedvalue for the maximum number of iterations, and the algorithm is terminatedafter the number of the iterations exceeds the threshold. We illustrate howthe algorithm converges iteratively in Fig. 2. The colored contours representthe objective function, and GD algorithm converges into the minimum step-bystep.The choice of the step size actually has a very important influence onthe convergence of the GD algorithm. We illustrate the convergence processesunder di erent step sizes in Fig. 3.3Stochastic Gradient DescentAccording to Eq. 5, it is observed that we have to visit all training data ineach iteration. Therefore, the induced cost is considerable especially when thetraining data are of big size.3

0.6 0.06 0.07 0.0710.5Objective function ationsFigure 3: The convergence of GD algorithm under di erent step sizes.Stochastic Gradient Descent (SGD), also known as incremental gradientdescent, is a stochastic approximation of the gradient descent optimizationmethod. In each iteration, the parameters are updated according to the gradient of the error (i.e., the cost function) with respect to one training sampleonly. Hence, it entails very limited cost.We summarize the SGD method in Algorithm 2. In each iteration, we firstrandomly shu e the training data, and then choose only one training example tocalculate the gradient (i.e., rJ( ; x(i) , y (i) )) to update . In our linear regressionmodel, rJ( ; x(i) , y (i) ) is defined asrJ( ; x(i) , y (i) ) ( T x(i)y (i) )x(i)(6)(i)(7)and the update rule is j j ( T x(i)y (i) )xjAlgorithm 2: Stochastic Gradient Descent for Linear Regression1: Given a starting point 2 dom J2: repeat3:Randomly shu e the training data;4:for i 1, 2, · · · , m do5: rJ( ; x(i) , y (i) )6:end for7: until convergence criterion is satisfiedCompared with GD where the objective cost function is decreased monotonically in each step, SGD does not have such a guarantee. In fact, SGD entails4

more steps to converge, but each step is cheaper. One variants of SGD is socalled mini-batch SGD, where we pick up a small group of training data and doaverage to accelerate and smoothen the convergence. For example, by randomlychoosing k training data, we can calculate the average the gradientk1XrJ( ; x(i) , y (i) )k i 14(8)A Closed-Form Solution to Linear RegressionWe first look at the vector form of the linear regression model. Assume2 (1) T 32 (1) 3(x )y676 . 7.X 4Y 4 . 55.(x(m) )T(9)y (m)Therefore, we haveX 23(x(1) )T 67.Y 45.x(m) )T 23 2y (1)h (x(1) )6 . 7 6.4 . 5 4.y (m)h (x(m) )y (1)y (m)Then, the cost function J( ) can be redefined asmJ( ) 1X(h (x(i) )2 i 1y (i) ) 1(X 2Y )T (X 375Y)(10)To minimize the cost function, we calculate its derivative and let it be zeror J( ) 1r (Y X )T (Y X )21r (Y T T X T )(Y X )21r tr(Y T Y Y T X T X T Y T X T X )21r tr( T X T X ) X T Y21 T(X X X T X ) X T Y2X T X X T YSince X T X X T Y 0, we have (X T X) 1 X T Y . Note that the inverse ofX T X does not always exist. In fact, the matrix X T X is invertible if and onlyif the columns of X are linearly independent.5A Probabilistic InterpretationAn interesting question is why the least square form of the linear regressionmodel is reasonable. We hereby give a probabilistic interpretation. We supposea target value y are sampled from a “line” T x with noise ". Therefore, we havey x "5

We assume " denote the noise and is independently and identically distributed(i.i.d.) according to a Gaussian distribution N (0, 2 ). The density of "(i) isgiven by 1 2f ( ) pexp2 22 Hence, the conditional probability density function of y given x is defined by 1(y T x)2p(y x; ) pexp2 22 It is shown that the distribution of y given x is parameterized by , i.e.,y x; N ( T x,2)Considering the training data {(x(i) , y (i) )}i 1,··· ,m are sampled independently, we define the corresponding likelihood function YY 1(y (i) T x(i) )2(i)(i)pexpL( ) p(y x ; ) 2 22 iiTo calculating the optimal such that the resulting linear regression modelfits the given training data best, we need to maximize the likelihood L( ). Tosimplify the computation, we use the log-likelihood function instead, i.e., ( ) log L( ) mY1(y (i) T x(i) )2plogexp2 22 i mX1(y (i) T x(i) )2log pexp2 22 i11 X (i)m log p(y T x(i) )22 2 i2 Apparently, maximizing L( ) is equivalent to minimizing1 X (i)(y2 i T x(i) )2Now, we conclude that, the rationality of adopting the least square in the linearmodel comes from the fact that the training data are sampled with Gaussiannoise.6