Predicting Car Prices with TensorFlow — a case of Multiple Linear Regression (2 of 2)

Continued from part one, where we’ve completed pre-processing (cleaning, formatting, etc) the dataset. Let’s now go ahead to build our TensorFlow model to help suggest near-perfect used car prices.

Our dataset now looks really good for training the model. But before we commence training, we need to separate a set of training data from another smaller set of test data.

We also need to define the labels (Y).

Let’s go!

Splitting the dataset

Let us split the dataset into a training set and a test set. We’ll only use the test set in the final evaluation of the model.

train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Inspecting the data

Next, using sns, let’s try to have a quick look at the joint distribution of a few pairs of columns from the training set.

sns.pairplot(train_dataset[["Price", "Year", "Engine", "Power"]], 
    diag_kind="kde")

distribution plot

Separate the features from labels

I now separate the target value, or “label”, from the features. And check the overall statistics:

train_labels = train_dataset.pop('Price')
test_labels = test_dataset.pop('Price')

train_stats = train_dataset.describe()
train_stats = train_stats.transpose()
train_stats

distribution plot

Normalize the data

A look at the statistical data above shows how widely different the ranges of each features are. To make our training easier and ensuring that our model is a lot less dependent on certain specific units rather than all of them, we’ll perform feature normalization.

def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
  
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

normed_train_data.head()

Training our model

We’ll use a Sequential model with eight multiple connected hidden layers, and an output layer that returns a single, continuous value.

def build_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=[len(train_dataset.keys())]),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

Build the model and view the summary

model = build_model()
model.summary()

Let’s now train the model for 1000 epochs, and record the training and validation accuracy in ‘history’.

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

Let’s now visualize the model’s training progress using the stats stored in the history object.

plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)

Using the MAE (Mean Absolute Error)

plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 5000])
plt.ylabel('MAE [Price]')

distribution plot

Testing our model

Let’s take a single batch from our normalised test data and predict the price

example_batch = normed_test_data[5:6]
example_batch

example_result = model.predict(example_batch)
print(example_result)

We get [[8259.985]], which means about 8,260 in USD

Let’s draw a prediction graph of our entire test data with our regression model

test_predictions = model.predict(normed_test_data).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [Price]')
plt.ylabel('Predictions [Price]')
lims = [0, 100000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

distribution plot

And finally, we can take a look at it’s error distribution:

error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [Price]")
_ = plt.ylabel("Count")

distribution plot

That does it. See the full Colab notebook here.

Olayinka Peter

⟵

Predicting Car Prices with TensorFlow — a case of Multiple Linear Regression (2 of 2)

Splitting the dataset

Inspecting the data

Separate the features from labels

Normalize the data

Training our model

Testing our model