Thursday, September 25, 2025

CLASS X (Artificial Intellegence) - Document Vectorization and Bag of Words (BoW)

 

Document Vectorization and Bag of Words (BoW)

 

Q.  What do you mean by corpus?

ANS - Corpus is the collection of all the documents

  

Q. What is term frequency?

ANS - Term frequency is the frequency of a word in one document.

 

Q. What is the full form of TFIDF?

ANS - Term Frequency and Inverse Document Frequency

 

Q. What is a document vector table?

ANS - Document Vector Table is used while implementing Bag of Words algorithm or Document Vector Table is a table containing the frequency of each word of the vocabulary in each document.

 

Q. Explain the concept of Bag of Words.

ANS - Bag of Words is a Natural Language Processing (NLP) model which helps in extracting features out of the text. In bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.

 

Q. Consider the following documents

Document 1: Aman and Anil are stressed.
Document 2: Aman went to a therapist.
Document 3: Anil went to download a health chatbot.

Implement all the four steps of Bag of Words (BoW) model to create a document vector table.

 

ANS - STEP 1. Text Normalisation

After text normalisation, the text becomes:
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]

 

STEP 2. Create Dictionary

create a dictionary means make a list of all the unique words occurring in all three documents.

Create Dictionary

 

STEP 3. Create document vectors

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not occur in that document, put a 0 under it.

 

STEP 4. Create document vectors for all the documents

Repeat step 3 for all the documents to get vectors.

document vectors for all documents

NOTE - The Above table is also called as TF-IDF (Term Frequency & Inverse Document Frequency)

 

Q. Create a document vector table for the given corpus:

Document 1: We are going to Mumbai

Document 2: Mumbai is a famous place.

Document 3: We are going to a famous place.

Document 4: I am famous in Mumbai.

 

Ans: -

No comments:

Post a Comment