Unlocking the Mysteries of Logistic Regression: From Predictions to Cost Functions

Delve into the world of Logistic Regression, a cornerstone of classification in machine learning. From understanding the pivotal role of input features and weights to the intricacies of the cost function, this post will guide you through the essence of how logistic regression models make decisions and learn from data.

Logistic Regression:

Logistic Regression is a statistical method for analyzing datasets where the outcome (dependent variable) is binary. It predicts the probability that a given instance belongs to a particular category.

Why not Linear Regression? For classification problems, outputting a linear combination of input features (like in linear regression) isn’t suitable because it can produce values less than 0 or greater than 1 — which doesn’t make sense for probabilities. Hence, logistic regression is used.
Sigmoid Function: Logistic regression uses the sigmoid (or logistic) function to squeeze the output of a linear equation between 0 and 1. The sigmoid function is given by:
$[ \sigma(z) = \frac{1}{1 + e^{-z}} ]$
Where ( z ) is the linear combination of input features and weights, i.e., $( z = w_1x_1 + w_2x_2 + … + w_nx_n + b )$.

The concepts of “input features” and “weights”

The concepts of “input features” and “weights” are fundamental to many machine learning algorithms, including logistic regression. Let’s break them down with an example:

Input Features:

Definition: Input features (often just called “features”) are the variables or attributes from your data that you input into a model to get a prediction.

Example:
Suppose you want to build a model that predicts whether a person is likely to purchase a bike. Here are some potential features:

Age: The age of the person (e.g., 25 years).
Income: Monthly income of the person (e.g., $4,000).
Distance from Work: How far the person lives from their workplace (e.g., 10 km).

These features represent individual aspects or characteristics about the data. In many algorithms, they are represented as a vector, where each entry in the vector corresponds to a feature.

Weights:

Definition: Weights determine the importance or influence of a particular feature on the prediction. In linear and logistic regression, these are coefficients that are multiplied by feature values. The process of “training” a model is essentially finding the best set of weights that results in the most accurate predictions for the given data.

Example:
Continuing with our bike purchase prediction, based on data and patterns, the model might determine:

Age Weight: -0.05 (Perhaps older people are slightly less likely to buy a bike)
Income Weight: 0.2 (Higher income might correlate with higher likelihood to buy a bike)
Distance from Work Weight: 0.3 (Those living further from work might be more inclined to buy a bike for commuting)

Given these weights and features for a person (Age: 25 years, Income: 4,000 USD, Distance from Work: 10 km), the linear combination $(z)$ is computed as:
$[ z = (-0.05 \times 25) + (0.2 \times 4000) + (0.3 \times 10) ]$

This value of z is then input into the logistic function to get a probability score in logistic regression.

In summary:

Features are what you know about the data.
Weights are what the model learns about the importance of each feature. The process of adjusting these weights (based on error) to improve the model’s prediction is the essence of training a machine learning model.

The term “bias” in machine learning is another essential concept, akin to the intercept in linear equations. Let’s dive in.

Bias:

Definition: Bias is a term in machine learning models that allows for flexibility in fitting the model to the data. It’s similar to the intercept in traditional linear equations. It adjusts the output independently of the input features, allowing the model’s prediction to be shifted up or down.

The equation for a linear combination that we provided before, $(z)$, when including bias, can be represented as:
$[ z = (w_1 \times \text{feature}_1) + (w_2 \times \text{feature}_2) + … + b ]$
Where:

($ w_i $) are the weights.
($ \text{feature}_i $) are the input features.
($ b $) is the bias.

In essence, while weights determine how much influence a feature has on the prediction, the bias allows the model to make predictions when all feature inputs are zero or to adjust the baseline prediction.

Example with the Bike Purchase:

Let’s continue with the bike purchase prediction model:

Imagine two individuals with the exact same features: Age, Income, and Distance from Work. Even if all these features are multiplied by their corresponding weights, the result might not be accurate for predicting the probability of a bike purchase. This is where the bias comes in.

Let’s say the bias is set to a positive value. This means that, by default, there’s a positive inclination for people to buy a bike, even before considering their age, income, or distance from work. This could be due to unaccounted factors like general health consciousness, environmental concerns, or a recent trend in cycling.

So, if our linear combination produces a value of 0 (meaning no particular inclination to buy or not buy a bike based on features alone), the positive bias might tilt the balance slightly towards buying. Conversely, a negative bias would tilt it away from buying.

In real-world data and modeling scenarios, there are countless influencing factors that aren’t always captured by the main features in our dataset. The bias helps account for the baseline tendencies in such scenarios.

In summary, while the weights adjust the influence of features, the bias adjusts the baseline or starting prediction, ensuring the model is as accurate as possible across all scenarios.

Logistic Regression Cost Function:

To train a logistic regression model, we need a measure of how well the predictions match the actual labels. This is where the cost function comes in.

Log-Loss (Binary Cross-Entropy):
For a single training example, the cost is given by:
$[ -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})] ]$
Where:

$( y )$ is the actual label (0 or 1).
$( \hat{y} )$ is the predicted probability that the label is 1.

Why this Cost Function? The above cost function penalizes confident and wrong predictions heavily. If the actual label $( y )$ is 1 but the model predicts $( \hat{y} )$ close to 0, the cost will be large, and vice-versa.
Cost Function for All Training Examples:
The overall cost function, $( J(w, b) )$, for logistic regression across all training examples is the average cost over all training examples:
$[ J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})] ]$
Where:

$( m )$ is the number of training examples.

The aim during training is to find parameters (weights and bias) that minimize this cost function. Gradient Descent or other optimization algorithms can be used for this purpose.