6.2 Car Sales

We have a dataset on 100 cars offered for sale at a car dealership. The variables are

Car sales: dataset
Variable	Descripción	Valores
Car ID	Identification code	1 - 100
Price	Sale Price of the car	000s Eur
Age	Age of the car,	months
PinkSlip	Certificate of Title	1: No, 2: Yes
Sold	Car sold?	1: No, 2: Yes

Venta de Vehículos: Primeras 12 observaciones
Car ID	1	2	3	4	5	6	7	8	9	10	11	12
Price	1	9	0	3	10	2	4	2	2	5	5	2
Odometer	30	20	170	68	12	88	3	41	21	74	41	121
Age	28	40	58	12	3	23	4	13	5	10	62	20
PinkSlip	1	1	0	1	0	0	1	1	1	1	0	1
Sold	1	0	1	1	0	0	0	1	1	1	0	1
Nota: http://www.zstatistics.com/

The dataset:

Venta de Vehículos: Resumen de Datos
No	Variable	Stats / Values	Freqs (% of Valid)
1	Car ID [integer]	Mean (sd) : 50.5 (29) min < med < max: 1 < 50.5 < 100 IQR (CV) : 49.5 (0.6)	100 distinct values (Integer sequence)
2	Price [numeric]	Mean (sd) : 5.2 (5.1) min < med < max: 0.5 < 4 < 34.5 IQR (CV) : 5.5 (1)	29 distinct values
3	Odometer [numeric]	Mean (sd) : 60.1 (76.7) min < med < max: 0.2 < 30.5 < 452.5 IQR (CV) : 63 (1.3)	100 distinct values
4	Age [integer]	Mean (sd) : 20.2 (16.1) min < med < max: 1 < 15 < 90 IQR (CV) : 19 (0.8)	39 distinct values
5	PinkSlip [integer]	Min : 0 Mean : 0.8 Max : 1	0 : 23 (23.0%) 1 : 77 (77.0%)
6	Sold [integer]	Min : 0 Mean : 0.7 Max : 1	0 : 35 (35.0%) 1 : 65 (65.0%)

\[ \operatorname{\widehat{Sold}} = 0.8 - 0.03(\operatorname{Price}) \]

Let’s think of the fitted line as estimating the chance of being sold:

\[{\pi_i= Prob(\operatorname{Sold}=1)} = \beta_0 + \beta_1 \operatorname{Price}_i + \epsilon_i\]

What would be the probability of being sold of a car costing 45k euros? -> We need to transform/modify the dependent variable.

\[\dfrac{\pi_i}{1-\pi_i}={\dfrac{Prob(\operatorname{Sold}=1)}{1- Prob(\operatorname{Sold}=1)}} = \beta_0 +\beta_1 \operatorname{Price}_i + \epsilon_i\]

Evitamos obtener probabilidad negativa, la distribución es muy asimétrica (no Normal) -> Necesitamos transformar/modificar la variable dependiente.

Using logs:

\[\log \Bigl ( \dfrac{\pi_i}{1-\pi_i} \Bigr ) =\log \Bigl ( \dfrac{Prob(\operatorname{Sold}=1)}{1- Prob(\operatorname{Sold}=1)} \Bigr ) = \beta_0 +\beta_1\operatorname{Price}_i + \epsilon_i \]

The Binomial Logistic Regression is given by

\[\operatorname{logit}(\pi_i)= \log \Bigl ( \dfrac{\pi_i}{1-\pi_i} \Bigr ) = \beta_0 +\beta_1x_{1i}+\ldots+\beta_kx_{ki}\]

A model used to predict the probability of a certain class, given a set of independent variables.

Binomial: la variable dependiente es binaria, $\pi_i = Prob(y_i=1)$
Logistic: uses log-odds and the logit function
$\beta_0, \beta_1, \ldots, \beta_k$ are the parameters
$x_1, \ldots, x_k$ are the independent variables or predictors

Foe example:

\[\log \Bigl ( \dfrac{\pi_i}{1-\pi_i} \Bigr ) = \operatorname{logit} \Bigl [ Prob(\operatorname{Sold}=1) \Bigr ] = \beta_0 +\beta_1\operatorname{Price}_i\]

Observations	100
Dependent variable	Sold
Type	Generalized linear model
Family	binomial
Link	logit

χ²(1)	9.454
Pseudo-R² (Cragg-Uhler)	0.124
Pseudo-R² (McFadden)	0.073
AIC	124.036
BIC	129.246

	Est.	S.E.	z val.	p
(Intercept)	1.386	0.356	3.894	0.000
Price	-0.143	0.053	-2.695	0.007
Standard errors: MLE

\[ \log\left[ \frac { \widehat{P( \operatorname{Sold} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{Sold} = \operatorname{1} )} } \right] = 1.39 - 0.14(\operatorname{Price}) \]

What does -0.143 mean in the estimated model?

\[\widehat{\operatorname{logit}(\pi_i)} =1.386 -0.143 \operatorname{Price}_i\]

For each unit increase in $\operatorname{Price}$, $\operatorname{logit}(\pi)$ decreases in 0.143 units
What about $\pi$?

From $\operatorname{logit}(\pi_i)$ to $\pi_i$:

we have:

\[ \operatorname{logit}(\pi) = \log \Bigl ( \dfrac{\pi}{1-\pi} \Bigr )=\beta_0 +\beta_1 \operatorname{Price}\]

then:

\[\pi = \dfrac{e^{\beta_0 +\beta_1 \operatorname{Price}}}{1+e^{\beta_0 +\beta_1 \operatorname{Price}}}\] - O alternativamente (más fácil):

\[\pi = \dfrac{e^{\operatorname{logit}(\pi) }}{1+e^{\operatorname{logit}(\pi) }}\]

In the example:

The coefficients determine the curve:

Multiple Logistic Regression

Assume that the $operatorname{logit}$ transformation of the dependent variable has a linear relationship with a set of independent variables.

Let us include one more variable in the model.

\[\log \Bigl ( \dfrac{\pi_i}{1-\pi_i} \Bigr ) =\log \Bigl ( \dfrac{Prob(\operatorname{Sold_i}=1)}{1- Prob(\operatorname{Sold_i}=1)} \Bigr ) = \beta_0 +\beta_1\operatorname{Price}_i +\beta_2\operatorname{Pink Slip}_i\]

Observations	100
Dependent variable	Sold
Type	Generalized linear model
Family	binomial
Link	logit

χ²(2)	18.407
Pseudo-R² (Cragg-Uhler)	0.232
Pseudo-R² (McFadden)	0.142
AIC	117.083
BIC	124.898

	Est.	S.E.	z val.	p
(Intercept)	0.396	0.480	0.824	0.410
Price	-0.173	0.057	-3.044	0.002
PinkSlip	1.555	0.531	2.926	0.003
Standard errors: MLE

\[ \log\left[ \frac { \widehat{P( \operatorname{Sold} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{Sold} = \operatorname{1} )} } \right] = 0.4 - 0.17(\operatorname{Price}) + 1.55(\operatorname{PinkSlip}) \]

Let’s estimate $\pi$

Parameter’s interpretation:

The coefficients of the logistic regression estimate the change in log-odds of the dependent variable given a one-unit increase in the independent variable.

Coeficientes Estimados
	Coeficiente	2.5 %	97.5 %
(Intercept)	0.396	-0.552	1.354
Price	-0.173	-0.295	-0.071
PinkSlip	1.555	0.533	2.632

$\beta_1=-0.173$ : If the price increases by 1000 euros, the log-odds of selling the car decreases, in mean, by 0.173, holding the rest constant.
$\beta_2= 1.555$ : If the car has Pink Slip, the log-odds of selling the car increase, on average, by 1,555, holding the rest constant.

If we apply $exp to the coefficients, we can interpret them as odds-ratios.

Coeficientes Estimados
	OR	2.5 %	97.5 %
(Intercept)	1.486	0.576	3.872
Price	0.841	0.745	0.931
PinkSlip	4.734	1.704	13.903

$\exp(\beta_1)=0.84$: If Price increases by one unit, the odds of being sold (versus not being sold) increases by a factor of 0.84.
$\exp(\beta_2)=4.73$: If Pink Slip increases by one unit, the odds of being sold (vs. not being sold) increase by a factor of 4.73.

$\beta$ represents the effect of $x$ on log-odds.
$\operatorname{exp}(\beta)$ represents the effect of $x$ on odds-ratio.