Natural Language Processing (NLP) is a branch of Artificial Intelligence that enables machines to read, understand, interpret, and generate human language.
Human language is:
NLP bridges the gap between human communication and machine understanding.
Almost 80% of the world’s data is unstructured text:
NLP is used in:
Text preprocessing is the process of cleaning and preparing raw text data so that it can be effectively used by machine learning or deep learning models.
Raw text is messy:
Preprocessing converts raw text into a structured and consistent format.
Raw text:
“This product is AMAZING!!! 😍🔥”
Processed text:
“this product is amazing”
Tokenization is the process of splitting text into smaller units called tokens.
Tokens can be:
Splits text into words.
Example:
“I love data science”
Tokens:
[“I”, “love”, “data”, “science”]
Splits text into sentences.
Example:
“Hello! How are you?”
Sentences:
[“Hello!”, “How are you?”]
Breaks words into smaller meaningful units.
Used in modern NLP models like BERT and GPT.
Stop words are common words that appear frequently but carry little meaningful information.
Examples:
Example:
“This movie is not good”
Removing stop words incorrectly may remove “not”, changing meaning.
Words appear in different forms:
Normalization reduces them to a common base form.
Stemming removes word suffixes using rule-based heuristics.
Example:
Advantages:
Disadvantages:
Lemmatization converts words to their dictionary form (lemma).
Example:
Advantages:
Disadvantages:
Bag of Words is a text representation technique that converts text into numerical vectors based on word frequency.
Order of words is ignored.
Sentences:
Vocabulary:
[I, love, NLP, AI]
Vectors:
TF-IDF (Term Frequency–Inverse Document Frequency) improves BoW by weighting important words more heavily.
Measures how often a word appears in a document.
Measures how rare a word is across all documents.
Rare words get higher weight.
Word “data” in a data science article gets high TF-IDF score.
Word embeddings represent words as dense numerical vectors that capture semantic meaning.
Words with similar meanings have similar vectors.
Vector arithmetic:
King – Man + Woman ≈ Queen
Text classification assigns predefined categories or labels to text.
Sentiment analysis identifies the emotional tone behind text.
Advanced sentiment analysis detects:
Text:
“The service was slow, but the food was excellent.”
Sentiment:
Mixed sentiment