Close Menu
    What's Hot

    Once saved, always saved

    March 12, 2026

    ‘I will end Carter Efe’ – Zazu King issues stern warning before May Day game

    March 10, 2026

    “You can’t wake up and become a musician”

    March 10, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    News Board NG
    • Home
    • Politics
    • Naija news
    • World News
    • Health News
    • Tech News
    • Entertainment
      • Events
      • Music
    • Religion
    • Lifestyle
    • Education
    • Foods
    News Board NG
    Home»World News»Machine Challenge Understanding Text Data
    World News

    Machine Challenge Understanding Text Data

    tundeoyeyemi2002By tundeoyeyemi2002July 6, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    It’s easy for you and me to understand any sentence we see on the computer screen in one second. But this is not the case with the machine. They cannot do this because they simply cannot understand and process textual data (data text) that exists in their original form.

    In practical cases, we need to break the text into a digital format that is easy to read and readable, which is the idea of ​​what we learn in NLP.

    The concepts of bag of words (BOOT) and TF-IDF play a role in this text understanding, where bow and TF-IDF are techniques that help us convert text sentences into numeric vectors.

    So, in this particular blog, we will discuss word bags and TF-IDF, and we will use intuitive and general examples to learn more about each concept. Let’s start learning some fun.

    Introduction to Machine Challenge Understanding Text Data

    We are mainly a new generation of students, just love shopping online (to varying degrees) and tend to always check product reviews before finalizing our purchase. I know many of you are the same!

    So, taking this case as an example, let’s take a thousand comments about a specific novel we plan to buy:

    • Comment 1: This novel is very humorous and long
    • Comment 2: This novel is not humorous, but it is very interesting
    • Comment 3: This novel is based on facts and interesting

    Obviously, we can say that we can draw many interesting insights from these comments and build on our perspectives to determine the novel’s rating. However, as mentioned above, we cannot simply hand these sentences to a machine learning model and ask it to predict whether a comment is positive or negative. Therefore, we need to perform certain text preprocessing steps such as BOW and TF-IDF.

    Create vectors with text

    Before I say it, can you suggest some techniques that we can use sentences at the beginning? Yes, if the word “embed word embedding” is compelling in your mind, you are absolutely right. So, what does the word embedding mean?

    In short, we can say that word embedding is just a technique in which we can use vectors to represent text. The form of word embeddings in word embeddings includes bows, which represent word bags and TF-IDF, which represents the term frequency of the reciprocal document frequency. Now let’s see how the novel comments mentioned above are represented as embeddings and prepare them for machine learning models.

    Please read also: Top technology companies around the world and their addresses

    Word bag (bow) model

    The Bag of Words (BOW) model is the simplest form of text representation using numbers. This is a common way to represent text data as input feature vectors for ML models. Like the term itself, we can represent sentences in the form of word vectors or more precise numeric strings.

    The best thing about implementing a bag of words in our model is that it is very flexible, easy to understand and implement.

    Do you know the practical application of a bag of words? This is a highly used method in natural language processing, document classification and searching information from documents.

    Try to recall three novel comments we’ve seen before:

    • Comment 1: This novel is very humorous and long
    • Comment 2: This novel is not humorous, but it is very interesting
    • Comment 3: This novel is based on facts and interesting

    Step 1: The comment list above is the data we collected for the implemented word model. For example, let’s consider this little example and treat all three lines as separate “documents”. In total they will become the corpus of our entire file.

    Step 2: The next step includes designing our vocabulary. We will first build a vocabulary and list all the unique words in the above three comments. The vocabulary consists of the following 12 words: “this”, “new”, “is”, “very”, “humor”, “and”, “long”, “not”, “,”, “interesting”, “factor”, “based”, “based”.

    Step 3: When building the vocabulary, now we need to create the document vector. Once a vocabulary is selected, the appearance of the word needs to be scored. Now we can pick each word and note their appearance to indicate the above three novel comments, representing 1s and 0s. This will provide us with three mediums for comments:

    this novel yes Very humor and Long no but Interesting fact based on Review length
    R1 1 1 1 1 1 1 1 0 0 0 0 0 7
    R2 1 1 2 0 1 0 0 1 1 1 0 0 8
    R3 1 1 1 0 0 1 0 0 0 1 1 1 7
    Machine Challenge Understanding Text Data

    Step 4: Comment Vector 1: [111111100000];Comment vector 2: [112010011100];Comment vector 3: [111001000111].

    The evaluation of novel comment texts is actually called classification problems, often referred to as sentiment analysis. A popular technique for developing sentiment analysis models is to implement word bags in our model building, which converts documents into vectors where scores are assigned to each word in the document.

    Here is a brief idea behind a bag of words (bow) models. I hope you have a basic understanding of this.

    Disadvantages of using BOW models

    Machine Challenge Understanding Text Data
    Machine Challenge Understanding Text Data

    In the example above, we see a vector of length 12. However, when we add new sentences in the dataset, we start to face problems. Let’s see why:

    1. First, if a new sentence proposes a new word, then our vocabulary will certainly increase, and therefore, the length of the vector will increase.
    2. Furthermore, if the vector contains many 0s, a sparse matrix will be generated, which needs to be avoided at any cost.
    3. Finally, we do not retain any information about sentence grammar or word ordering in text.

    Are you a programming enthusiast or would you like to learn how to implement a bag of word models? If so, then you are on the right track. Read the article Implement a bag of words using Python.

    Please read also: The meaning of technology and its different uses

    Frequency Countdown Document Frequency

    Therefore, to overcome some of the shortcomings of these word bags, we have moved to the TF-IDF vector for a given document. Given the formal definition of TF-IDF, we can define it as a numerical statistic, designed to reflect the importance of words to documents in a collection or corpus. So, what is the frequency of the term?

    Simple

    D: TF t, d = n t, d /Number of terms in the document. Here, n represented in the numerator is the number of times the term t appears in document d, so each document and term has its own TF value.

    So, what is the current IDF? This is a measure of the importance of quantifiers in documents. We need IDF values ​​because calculating TF alone is not enough to understand the importance of words. In mathematical representation, we can represent it as:

    IDFt = log (number of documents/number of documents with the term “t”).

    In short, we can conclude that term frequency is the rating of the word frequency in the current document, while reverse document frequency: is the rating of the rarity of the word in a cross document.

    We will once again use the previous novel review vocabulary to show how to calculate TF’s Comment 2: This novel is not humorous but interesting. here,

    Step 1: Vocabulary: ‘this’, “novel”, “yes”, “no”, “humor”, “but”, “yes”, “funny”. Number of words in comments 2 = 8.

    Step 2: tf “this” = (number of times the word ‘this’ appears in comment 2) / (number of terms in comment 2) = 1/8

    Step 3: Again, we can calculate the term frequency of all terms and all comments in this way:

    semester Comment 1 Comment 2 Comment 3 TF (Comment 1) TF (Comment 2) TF (Comment 3)
    this 1 1 1 1/7 1/8 1/7
    novel 1 1 1 1/7 1/8 1/7
    yes 1 2 1 1/7 1/4 1/7
    Very 1 0 0 1/7 0 0
    humor 1 1 0 1/7 1/8 0
    and 1 0 1 1/7 0 1/7
    Long 1 0 0 1/7 0 0
    no 0 1 0 0 1/8 0
    but 0 1 0 0 1/8 0
    Interesting 0 1 1 0 1/8 1/7
    fact 0 0 1 0 0 1/7
    based on 0 0 1 0 0 1/7
    Machine Challenge Understanding Text Data

    The next question emerged with the advantages of TF-IDF, so we left a bag of word models. To be very correct, the vectorized TF-IDF form is very easy to calculate, we have some basic metrics to extract the most descriptive terms in the document.

    Furthermore, we can easily calculate similarity between 2 documents using this technique.

    But this vectorization form also has some disadvantages. Because TF-IDF is based on the principle of bag-of-word (BOW) models; therefore, it does not capture text, semantics, or anywhere in different documents that occur together.

    This is the only reason why TF-IDF is only useful as a vocabulary level function. Furthermore, it also fails to capture semantics compared to topic models and word embeddings.

    The ending words….

    From the above section, we can conclude that the bag of words is an algorithm or a count that calculates how many times a word appears in a document. And with TF-IDF, the word will weigh. TF-IDF helps measure correlation, not frequency.

    That is, the word count is replaced by the TF-IDF scores of the entire dataset. There is no doubt that the bag of words is easy to explain, but TFIDF usually performs better in machine learning models. However, there are still many obstacles in these forms of word embedding techniques. This is where Word2Vec, gloves, continuous words (CBOW), FastText, Skip Gram and more word embedding techniques are.

    Contributor: Ram Tavva

    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    tundeoyeyemi2002

    Related Posts

    Pretty actress Sarah Martins apologizes to Lagos government for cooking for poor people in public

    March 8, 2026

    Ayatollah Alireza Alafi appointed interim leader of Iran after Ali Husseini Khamenei’s elimination

    March 1, 2026

    APC wins 5 out of 6 chairmanships in FCT regional council elections

    February 22, 2026

    Comments are closed.

    Our Picks
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss

    Once saved, always saved

    Religion March 12, 2026

    One of the most important questions in the Christian faith is: Can believers lose their…

    ‘I will end Carter Efe’ – Zazu King issues stern warning before May Day game

    March 10, 2026

    “You can’t wake up and become a musician”

    March 10, 2026

    Tiwa Savage launches Landmark Music Foundation in Lagos to support African talent

    March 10, 2026
    Our Picks

    Once saved, always saved

    March 12, 2026

    ‘I will end Carter Efe’ – Zazu King issues stern warning before May Day game

    March 10, 2026

    “You can’t wake up and become a musician”

    March 10, 2026
    Legal Pages
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.