Intro

MLH GHW: Introduction to NLP

MLH Global Hack Week: Introduction to NLP | Slides

One Hot Encoding

Definition: One-hot encoding is a method of converting categorical variables into machine-readable, numerical format where each category is represented by a unique binary vector. Each vector is the same length as the number of categories, and only one element is 1 (hot), while all other elements are 0 (cold).

One Hot Encoding in Machine Learning

How It Works:

Suppose you have a vocabulary of 4 words: ["cat", "dog", "fish", "bird"].
Each word is assigned a unique index.
The word “cat” would be represented as [1, 0, 0, 0].
The word “dog” would be represented as [0, 1, 0, 0], and so on.

Characteristics:

Simplicity: Very simple and easy to understand.
Sparsity: The vectors are sparse, with most elements being zero.
Dimensionality: The dimensionality of the representation is equal to the number of unique categories (vocabulary size).
No Context: Does not capture any information about the relationships or context between categories.

One Hot Encoding using Sci-kit Learn Library:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
 
encoder = OneHotEncoder(sparse=False)
data = np.array(["cat", "dog", "fish", "bird"]).reshape(-1, 1)
encoded = encoder.fit_transform(data)
print(encoded)

Multi-hot encoding

multi-label

Suppose we have a particular document.
- The simple one-hot or multi-hot way of dealing with this is to set one in each position in a V vocabulary length vector for every word in the document.
Other explanations
- https://stats.stackexchange.com/a/467672
- https://www.quora.com/What-is-a-multi-hot-vector-in-machine-learning

Bag-of-Words (BoW)

Bag-of-Words is a technique for representing text data where each document or piece of text is represented by a vector of word counts or frequencies, ignoring grammar and word order but keeping track of word occurrences.

How It Works:

Create a vocabulary of all unique words in the dataset.
Represent each document as a vector where each element counts the occurrences of a word from the vocabulary in the document.

Characteristics:

Frequency Information: Captures how often words appear, which can provide some insight into the content of the document.
Sparsity: Typically results in sparse vectors, especially in large vocabularies.
Dimensionality: The dimensionality of the representation is equal to the size of the vocabulary.
No Context: Like one-hot encoding, it does not capture context or relationships between words.

Bag of Words using Sci-kit Learn Library:

from sklearn.feature_extraction.text import CountVectorizer
 
corpus = [
    'cat cat dog',
    'dog dog fish',
    'fish bird cat'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())