Logistic Regression

Logistic Regression

*note* if you just want to play with the source code check out my GitHub for the final, revised, easy to run code or go to the bottom of the page. For you hardcore mathy folks… just read through

Logistic Regression or Logit Regression is a regression model where the dependent variable is categorical. The model is composed of some set of variable weights, w, and takes in data, x, to make predictions about y, the dependant variable.

Binary Classification

The simplest case of logistic regression is binary classification. That is, where the output can take on one of two values. For some variable y it must belong to a set A which is typically written as:


A could similarly be written to take on values like (‘Pass’,’Fail’), (True,False), (‘Heads’,’Tails’) et cetra.

Logistic Regression tries to estimate y by finding the answer to:


The Sigmoid Function


The first question you might have is: Why do we want to use the sigmoid function? The following python code will plot the sigmoid function for some input z

# Import some useful libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab

# Computes the sigmoid on an array of values elementwise
def sigmoid(z):
    return 1/(1+np.exp(-z))

# Sample the sigmoid function in z
z = np.arange(-10,10,.05)
a = sigmoid(z)

############################## Plot ###############################
# Set some fun Plot Parameters
params = {'legend.fontsize': 'xx-large',
'figure.figsize': (10, 5),
'axes.labelsize': 'xx-large',

# Setup Subplot
fig = plt.figure()
plt.title(r'$\sigma(z)$ vs. $z$')

sigmoid fig1

Upon closer look, we can see that the sigmoid function takes on values continuously between 0 and 1 which is ideal for the creation of the probability function we require for binary classification.

Logistic Regression Computational Graph

A compuational graph of logistic regression can be made that makes it easy to see the steps required to generate estimates. That graph can be seen below


This should make it fairly clear how to compute estimates of y via y hat

Computing the Cost Function

Now that our model can generate estimates, how can we calculate the required updates of our model parameters (w,b) in order to correct the error between our model and the true value?

The function that calculates this error on a sample by sample basis is called the loss function. It generally takes the form of:

This equation takes the estimate of y and y and returns a metric of the difference between the two.

With logistic regression that loss function typically takes on the form


Talking through it, it states that if the label y is equal to 1, then the loss is equal to log(P(y=1)|x,w,b). If the label is zero then the loss is equal to log(P(y=0|x,w,b))

The Cost function is computed over many examples of x and y. Therefore the cost function is an average of the output of the loss function computed as


Updating Parameters From Cost

How do we actually update the parameters of linear regression as a function of the loss? Prefereably we’d take a look at the loss function and try to move the parameters so that the loss function is minimized. This can be achived by finding the gradient of the loss function with respect to the parameters and applying it to the original parameter. We must find:


from the equations in section 4 we can deduce through the chain rule that:


as a result we can surmise that:


and therefore:

Once we’ve computed these gradients, we can update each parameter by nudging the parameters by some step size in the direction of the gradient. This will have the result of reducing the cost the new model would generate if it observed the same input samples. This is just one way in which we can update the logistic regression model, in order to learn more about this process check out the post on optimization algorithms.

Of course, we typically wont update the model off of the basis of a single input sample. We’d want compute the gradient of the cost function. Following from above this simply becomes:


Now we have all the elements required to update our logistic regression model with stochastic gradient descent. The process follows the following flow:

Logistic Regression with Stochastic Gradient Descent:

  1. Wrangle data
  2. Initialize model
  3. Compute model estimates from samples
  4. Compute updates
  5. Apply updates
  6. Repeat steps 3, 4, and 5

this process is reflected in the code below:

# Wrangle Data
n    = 1000 # number of steps
m    = 1000  # Samples per step
k    = 2    # number of features
dims = {'numExperiments':0,'numSamps':1}
class1 = np.random.uniform(-10,10,(k,m))
class2 = np.random.uniform(-5,5,(k,m))
x = np.concatenate((class1,class2),axis=1)
y = np.concatenate((np.ones((1,m)), np.zeros((1,m))),axis = 1)

fig = plt.figure()
plt.title('x vs. y')
plt.scatter(class1[1],class1[0],label='Class 1')
plt.scatter(class2[1],class2[0],label='Class 2')

eps = 1/(10**7) # for numerical stability

# Initialize Model
weights = np.random.randn(k,1) # Initialize the weights to a random value
bias    = 0                    # Initialize bias parameter to 0
learningRate = .01
biasLog = []
weightLog1 = []
weightLog2 = []
costLog   = []

for i in range(n):
    # Compute the model estimates recall z = w'x+b a = sigma(z)
    z = np.dot(weights.transpose(),x)+bias
    a = sigmoid(z)

    # Compute the Updates
    J = (y*np.log(a)+(1.0-y)*np.log(1-a))
    dyhat = (-y/(a+(eps)))+(1.-y)/((1.-a)+(eps))
    dz = a*(1.-a)
    dwn = x

    dw = np.mean(dyhat*dz*x,axis=1,keepdims=True)
    db = np.mean(dyhat*dz,axis=1,keepdims=True)

    # Apply Updates
    weights = weights - learningRate*dw
    bias    = bias-learningRate*db

def prediction(weights,bias,x):
    z = np.dot(weights.transpose(),x)+bias
    a = sigmoid(z)>.5

We can see that these two classes slightly overlap and the stochastic gradient decent algorithm finds the decision line that minimizes the errors.



Impact of Samples per Update

Lets say that the function you use to estimate y is completely random. What would that mean for the cost function? Below we’ve plotted the cost for 1000 experiments that had 2 samples in each experiment. In the figure below you can see the cost of each of these experiments. You can see that a random guesser produces an average cost of around .74 and a fairly large variance of .04.

# Set up variables
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.animation as manimation

# Import movie writer
FFMpegWriter = manimation.writers['ffmpeg']
metadata = dict(title='Movie Test', artist='Matplotlib',
comment='Movie support!')
writer = FFMpegWriter(fps=2, metadata=metadata)

# Compute Loss
def computeLoss(y,yhat):
    return y*np.log(yhat)+(1-y)*np.log(1-yhat)

# Compute sigmoid
def sigmoid(z):
    return 1/(1+np.exp(-z))

# Setup figure
fig = plt.figure()
l, = plt.plot([], [],'o')
plt.xlim(0, 1000)

with writer.saving(fig, "writer_test.mp4", 100):
    n = 1000 # number of experiments
    dims = {'numExperiments':0,'numSamps':1}
    var = []
    mu = []

    for m in range(1,100,4):
        y = np.random.uniform(0,1,(n,m))>.5

        yhat = sigmoid(np.random.uniform(0,1,(n,m)))

        loss = computeLoss(y,yhat)

        cost = -(np.mean(loss,axis=dims['numSamps']))

        mu = np.mean(cost)

        var = np.var(cost)

        plt.title(r'Experiment (%d samples) vs. Cost $\mu$ = %.3f $\sigma$(cost)=%.3f)' % (m,mu,var))

        l.set_data(range(n), cost)


The graph below shows the effect of increasing the number of samples per experiment on the variance of the cost function. As the number of samples goes up, the variance in the ultimate cost estimate drops. This means that you will more reliably move down the cost function in the right direction as you base your decision on more data.

n    = 1000 # number of experiments
mSet = range(1,100,2) # samples per experiment
dims = {'numExperiments':0,'numSamps':1}
var  = []
mu   = []
for m in mSet:
    yhat = sigmoid(np.random.uniform(0,1,(n,m)))
    y    = np.random.uniform(0,1,(n,m))>.5
    loss = computeLoss(y,yhat)
    cost = -(np.mean(loss,axis=dims['numSamps']))

def computeLoss(y,yhat):
    return y*np.log(yhat)+(1-y)*np.log(1-yhat)

# Compute the cost function of each sample

fig = plt.figure()
plt.title(r'Samples Per Experiment vs. $\sigma$(cost)')
plt.ylabel(r'Samples Per Experiment')
fig = plt.figure()
plt.title(r'Samples Per Experiment vs. $\sigma$(cost)')
plt.ylabel(r'Samples Per Experiment')



And that is all I have to say about logistic regression with stochastic gradient decent! There are many different methods to optimize stochastic gradient decent. Maybe we’ll take a look at that in a later post.

If you’re interested in playing around with the source, you can grab the jupyter notebook I used to generate this post on my git hub at GitHub: 3dbabove

# Import Packages
import numpy as np

# Create Data Set
def createDataSetOverlapSquares(k,m):
    Create the data set used in regression
        - k - The number of input features
        - m - The number of samples to generate
        - x - The (k,m) Data
        - y - The (1,m) labels
    assert(k>0 and m>0)

    dims = {'numExperiments':0,'numSamps':1}
    class1  = np.random.uniform(10,20,(k,m))
    class2  = np.random.uniform(5,15,(k,m))
    xTrain = np.concatenate((class1,class2),axis=1)
    yTrain = np.concatenate((np.ones((1,m)), np.zeros((1,m))),axis = 1)

    class1 = np.random.uniform(10,20,(k,m))
    class2 = np.random.uniform(5,15,(k,m))
    xTest = np.concatenate((class1,class2),axis=1)
    yTest = np.concatenate((np.ones((1,m)), np.zeros((1,m))),axis = 1)

    return xTrain,xTest,yTrain,yTest

# Initialize The Model
def initializeModel(k):
    Initializez the weights and bias used in the model
        - k - The number of input features
        - m - The number of samples to generate
        - x - The (k,m) Data
        - y - The (1,m) labels
    weights = np.random.randn(k,1) # Initialize the weights to a random value
    bias    = 0                    # Initialize bias parameter to 0

    return weights,bias

# Define the sigmoid function
def sigmoid(z):
    For arbitrary input size, takes the element wise sigmoid
        - z - the input to the activation function
        - a - value of sigmoid(z) for each element of z
    a = 1/(1+np.exp(-z))
    return a

# compute costs
def computeCostsAndGradients(weights,bias,x,y):
    compute the cost and gradient of the forward pass using x and y
    for a model containing weights and bias
        - weights - the weights of the model
        - bias    - the model bias
        - x - the x input to the model
        - y - the expected label
    # For numerical Stability
    eps = (1/10**7)

    # Compute the model estimates recall z = w'x+b a = sigma(z)
    z = np.dot(weights.transpose(),x)+bias

    a = sigmoid(z)

    # Compute the Updates Chain Rule Intermediates
    cost = -np.mean(y*np.log(a)+(1.0-y)*np.log(1-a))
    dyhat = (-y/(a+(eps)))+(1.-y)/((1.-a)+(eps))
    dz = a*(1.-a)
    dwn = x

    # Compute Updates
    dw = np.mean(dyhat*dz*x,axis=1,keepdims=True)
    db = np.mean(dyhat*dz,axis=1,keepdims=True)

    grads = {"dw": dw,
             "db": db}

    return grads, cost

def optimize(weights,bias,xTrain,yTrain,iterations,learningRate):
    Computes iterations number of optimizations
        - weights - the weights of the model
        - bias    - the model bias
        - xTrain  - the training input
        - yTrain  - the training label
        - iterations - the number of iterations to run
        - learningRate - the step size for the optimization
        - parameters - the weights and bias of the learned model
        - grads - the gradients at each step
        - costs - the cost at each step
    costs = []

    for i in range(iterations):
        grads,cost = computeCostsAndGradients(weights,bias,xTrain,yTrain)

        dw = grads["dw"]
        db = grads["db"]

        #simple stochastic gradient descent
        weights = weights-learningRate*dw
        bias = bias-learningRate*db


    parameters = {"weights":weights,

    grads = {"dw":dw,

    return parameters,grads,costs

def prediction(weights,bias,x):
    Performs a prediction of y for input x and model defined by weights and bias
        - weights - the weights of the model
        - bias    - the model bias
        - x       - the inputs we'd like to predict for
        - a - the prediction from inputx
    z = np.dot(weights.transpose(),x)+bias
    a = sigmoid(z)>.5

def model(xTrain, xTest, yTrain,yTest, iterations = 10000, learningRate = 0.01):

    # Initialize the weights and bias
    weights, bias = initializeModel(xTrain.shape[0])

    parameters,grads,costs = optimize(weights,bias,xTrain,yTrain,iterations,learningRate)

    weights = parameters["weights"]
    bias    = parameters["bias"]

    yPredTest  = prediction(weights, bias, xTest)
    yPredTrain = prediction(weights, bias, xTrain)

    predTestRes = yPredTest == yTest
    predTrainRes = yPredTrain == yTrain

    d = {"weights":weights,
         "predTestRes": predTestRes,

    return d

# Create the data set
k = 5    # Two variables
m = 1000 # 4000 samples per iteration
xTrain,xTest,yTrain,yTest = createDataSetOverlapSquares(k,m)

# Call the model for generation and optimization
modelVar = model(xTrain, xTest, yTrain,yTest, iterations = 5000, learningRate = 0.01)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s