Definition

One-hot vectors are a common way to represent categorical data as numerical data in machine learning. In a one-hot encoding, each category is represented as a binary vector, where only one element is 1 (hot), and all others are 0.

Why Use One-Hot Encoding?

  • Categorical Data: Many machine learning algorithms require numerical input, so categorical data must be converted to numerical form.
  • Avoid Ordinal Relationships: One-hot encoding prevents algorithms from assuming a natural ordering between categories.

How One-Hot Encoding Works

Suppose you have a list of categories: ["cat", "dog", "fish"].

  1. Unique Categories: Identify all unique categories.
  2. Binary Vector: Create a binary vector for each category where only the index corresponding to the category is 1.

For example:

  • “cat” [1, 0, 0]
  • “dog” [0, 1, 0]
  • “fish” [0, 0, 1]

One-Hot Encoding in Python

Using Pandas

Pandas has a built-in method for one-hot encoding called get_dummies.

import pandas as pd
 
# Example DataFrame
data = {'Animal': ['cat', 'dog', 'fish', 'cat', 'fish']}
df = pd.DataFrame(data)
 
# One-hot encoding using get_dummies
one_hot_encoded_df = pd.get_dummies(df, columns=['Animal'])
 
print(one_hot_encoded_df)

Output:

   Animal_cat  Animal_dog  Animal_fish
0           1           0            0
1           0           1            0
2           0           0            1
3           1           0            0
4           0           0            1

Using Scikit-Learn

Scikit-learn provides a OneHotEncoder for this purpose.

from sklearn.preprocessing import OneHotEncoder
 
# Example data
data = [['cat'], ['dog'], ['fish'], ['cat'], ['fish']]
 
# Create the encoder
encoder = OneHotEncoder(sparse=False)
 
# Fit and transform the data
one_hot_encoded = encoder.fit_transform(data)
 
print(one_hot_encoded)

Output:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]

Using TensorFlow

If you are working with TensorFlow, you can use its built-in functionality to create one-hot encodings.

import tensorflow as tf
 
# Example data
categories = tf.constant(['cat', 'dog', 'fish', 'cat', 'fish'])
 
# Integer encode the categories
category_indices = tf.factorize(categories)[0]
 
# One-hot encode the indices
one_hot_encoded = tf.one_hot(category_indices, depth=3)
 
print(one_hot_encoded)

Output:

tf.Tensor(
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]], shape=(5, 3), dtype=float32)

Statistical power refers to the ability of a model to reliably identify real relationships between features and labels.

One-hot encoding reduces statistical power more than continuous or ordinal data, because it requires multiple columns – one for each possible categorical value. For example, if we one-hot encode the port of embarkation, we add three model inputs (C, S, and Q).

A categorical variable becomes helpful if the number of categories is substantially less than the number of samples (dataset rows). A categorical variable also becomes helpful if it provides information not already available to the model through other inputs.