Thursday, September 25, 2025

CLASS X (Artificial Intellegence) - Document Vectorization and Bag of Words (BoW)

 

Document Vectorization and Bag of Words (BoW)

 

Q.  What do you mean by corpus?

ANS - Corpus is the collection of all the documents

  

Q. What is term frequency?

ANS - Term frequency is the frequency of a word in one document.

 

Q. What is the full form of TFIDF?

ANS - Term Frequency and Inverse Document Frequency

 

Q. What is a document vector table?

ANS - Document Vector Table is used while implementing Bag of Words algorithm or Document Vector Table is a table containing the frequency of each word of the vocabulary in each document.

 

Q. Explain the concept of Bag of Words.

ANS - Bag of Words is a Natural Language Processing (NLP) model which helps in extracting features out of the text. In bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.

 

Q. Consider the following documents

Document 1: Aman and Anil are stressed.
Document 2: Aman went to a therapist.
Document 3: Anil went to download a health chatbot.

Implement all the four steps of Bag of Words (BoW) model to create a document vector table.

 

ANS - STEP 1. Text Normalisation

After text normalisation, the text becomes:
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]

 

STEP 2. Create Dictionary

create a dictionary means make a list of all the unique words occurring in all three documents.

Create Dictionary

 

STEP 3. Create document vectors

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not occur in that document, put a 0 under it.

 

STEP 4. Create document vectors for all the documents

Repeat step 3 for all the documents to get vectors.

document vectors for all documents

NOTE - The Above table is also called as TF-IDF (Term Frequency & Inverse Document Frequency)

 

Q. Create a document vector table for the given corpus:

Document 1: We are going to Mumbai

Document 2: Mumbai is a famous place.

Document 3: We are going to a famous place.

Document 4: I am famous in Mumbai.

 

Ans: -

CLASS X (Aritificial Intelligence) - Confusion Matrix and the four methods to evaluate the model

 

Confusion Matrix and the four methods to evaluate the model

Confusion Matrix - The comparison between the results of Prediction and reality is called the Confusion Matrix.

How to interpret a confusion matrix for a machine learning model

There are four methods to evaluate the model











 
 

  

 

 

 

 

 

 

 

 

 

 

 

 

 


QUSTION

Consider the scenario where the AI model is created to predict if there will be rain or not. The confusion matrix for the same is given below. Calculate precision, accuracy and recall.

TP = 70   TN = 50

FN = 50  FP = 30

NOW CALCULATE ALL FOUR (ACCURACY, PRECISION, RECALL AND F1 SCORE) USING GIVEN FORMULAS



 

 

 

 

ACCURACY = (70+50) / (70 + 50 + 30 + 50) * 100

ACCURACY = 120/200 * 100

ACCURACY = 60 %

 



 

 

 

 

 


PRECISION = 70/(70 + 30) * 100

PRECISION = 70/100 * 100

PRECISION = 70%

 

 

RECALL = 70/(70 + 50)

RECALL = 70/120

RECALL = 0.583

 

 

 F1 SCORE = 2 * (70 * 0.583)/( 70 + 0.583)

F1 SCORE = 2 * (40.81)/ 70.583

F1 SCORE = 2 * 0.578

F1 SCORE = 1.156

 

How many total tests were performed in the above scenario?

TP + TN + FN + FP

70 + 50 + 50 + 30 = 200

 

 

 

 

HOMEWORK