6_ Gradient Descent method
Gradient descent method is an important search strategy in the field of machine learning. In this chapter, we will explain the basic principle of the gradient descent method in detail, and improve the gradient descent algorithm step by step, so that we can understand the significance of various parameters in the gradient descent method, especially the learning rate.
At the same time, we will also extend two methods: random gradient descent method and small batch gradient descent method, so that we can have a comprehensive understanding of the gradient descent method family
61 what is gradient descent method

The derivative represents the corresponding change of J when theta unit changes

The derivative can represent the direction, corresponding to the direction in which J increases
 Not all functions have unique extreme points (multivariate multiple functions)
63 gradient descent method in linear regression
Since the size of the gradient is obviously unreasonable due to the number of samples m, divide it by the number of samples m to make it not affected by the number of samples.
64 gradient descent method in linear regression
65 vectorization and data standardization of gradient descent method
66 random gradient descent method
 Idea of simulated annealing algorithm: it is obtained by imitating the annealing phenomenon in nature, and makes use of the similarity between the annealing process of solid matter in physics and general optimization problems.
Starting from an initial temperature, with the continuous decline of temperature, combined with the characteristics of probability jump, the global optimal solution is randomly found in the solution space
68 how to determine the accuracy of gradient calculation? Debug gradient descent method
# Code on ipynb without print() import numpy as np import matplotlib.pyplot as plt np.random.seed(666) X = np.random.random(size = (1000,10)) true_theta = np.arange(1,12,dtype = float) X_b = np.hstack([np.ones((len(X),1)),X] ) y = X_b.dot(true_theta) + np.random.normal(size = 1000) print(X.shape) print(y.shape) print(true_theta) def J(theta,X_b,y): #Define loss function try: return np.sum((yX_b.dot(theta))**2 ) / len(X_b) except: return float("inf") def dJ_math(theta,X_b,y): # Define gradient mathematical formula calculation return X_b.T.dot(X_b.dot(theta)  y)*2. / len(y) def dJ_debug(theta,X_b,y,epsilon=.01): # Define gradient debug calculation res = np.empty(len(theta)) for i in range(len(theta)): theta_1 = theta.copy() theta_1[i] += epsilon theta_2 = theta.copy() theta_2[i] = epsilon res[i] = (J(theta_1,X_b,y)  J(theta_2,X_b,y))/(2*epsilon) return res def gradient_descent(dJ,X_b,y,initial_theta,eta=1e2,epsilon=1e8,n_iters=1e4): theta = initial_theta i_iters = 0 while i_iters < n_iters: gradient = dJ(theta,X_b,y) last_theta = theta theta = theta  eta * gradient if (abs(J(theta,X_b,y)J(last_theta,X_b,y)))< epsilon: break i_iters += 1 return theta X_b = np.hstack( ( np.ones((len(X),1)),X) ) initial_theta = np.zeros(X_b.shape[1]) eta = 0.01 %time theta = gradient_descent(dJ_debug,X_b,y,initial_theta,eta) theta %time theta = gradient_descent(dJ_math,X_b,y,initial_theta,eta) theta
tip
dJ_debug is used to verify the debugging gradient. It is slow. You can take a small number of samples and use dJ_debug gets the correct result, then deduces the mathematical solution with the formula, and compares the results.
dJ_debug is not affected by the current loss function J, and the gradient is universal.
69 more indepth discussion on gradient descent method
BGD: the whole sample needs to be traversed every time, and the direction of the fastest gradient descent is certain every time, stable but slow.
SGD: look at only one sample at a time. The direction of gradient descent is uncertain and may even move in the opposite direction. It is fast but unstable.
MBGD: a compromise between two extreme methods, k samples at a time, and k becomes a super parameter.
Summary: some related codes can only run in VSC, but Jupyter can't run, especially the hstack() function.