Machine Challenge Understanding Text Data

It’s easy for you and me to understand any sentence we see on the computer screen in one second. But this is not the case with the machine. They cannot do this because they simply cannot understand and process textual data (data text) that exists in their original form.

In practical cases, we need to break the text into a digital format that is easy to read and readable, which is the idea of what we learn in NLP.

The concepts of bag of words (BOOT) and TF-IDF play a role in this text understanding, where bow and TF-IDF are techniques that help us convert text sentences into numeric vectors.

So, in this particular blog, we will discuss word bags and TF-IDF, and we will use intuitive and general examples to learn more about each concept. Let’s start learning some fun.

Introduction to Machine Challenge Understanding Text Data

We are mainly a new generation of students, just love shopping online (to varying degrees) and tend to always check product reviews before finalizing our purchase. I know many of you are the same!

So, taking this case as an example, let’s take a thousand comments about a specific novel we plan to buy:

Comment 1: This novel is very humorous and long
Comment 2: This novel is not humorous, but it is very interesting
Comment 3: This novel is based on facts and interesting

Obviously, we can say that we can draw many interesting insights from these comments and build on our perspectives to determine the novel’s rating. However, as mentioned above, we cannot simply hand these sentences to a machine learning model and ask it to predict whether a comment is positive or negative. Therefore, we need to perform certain text preprocessing steps such as BOW and TF-IDF.

Create vectors with text

Before I say it, can you suggest some techniques that we can use sentences at the beginning? Yes, if the word “embed word embedding” is compelling in your mind, you are absolutely right. So, what does the word embedding mean?

In short, we can say that word embedding is just a technique in which we can use vectors to represent text. The form of word embeddings in word embeddings includes bows, which represent word bags and TF-IDF, which represents the term frequency of the reciprocal document frequency. Now let’s see how the novel comments mentioned above are represented as embeddings and prepare them for machine learning models.

Please read also: Top technology companies around the world and their addresses

Word bag (bow) model

The Bag of Words (BOW) model is the simplest form of text representation using numbers. This is a common way to represent text data as input feature vectors for ML models. Like the term itself, we can represent sentences in the form of word vectors or more precise numeric strings.

The best thing about implementing a bag of words in our model is that it is very flexible, easy to understand and implement.

Do you know the practical application of a bag of words? This is a highly used method in natural language processing, document classification and searching information from documents.

Try to recall three novel comments we’ve seen before:

Comment 1: This novel is very humorous and long
Comment 2: This novel is not humorous, but it is very interesting
Comment 3: This novel is based on facts and interesting

Step 1: The comment list above is the data we collected for the implemented word model. For example, let’s consider this little example and treat all three lines as separate “documents”. In total they will become the corpus of our entire file.

Step 2: The next step includes designing our vocabulary. We will first build a vocabulary and list all the unique words in the above three comments. The vocabulary consists of the following 12 words: “this”, “new”, “is”, “very”, “humor”, “and”, “long”, “not”, “,”, “interesting”, “factor”, “based”, “based”.

Step 3: When building the vocabulary, now we need to create the document vector. Once a vocabulary is selected, the appearance of the word needs to be scored. Now we can pick each word and note their appearance to indicate the above three novel comments, representing 1s and 0s. This will provide us with three mediums for comments:

	this	novel	yes	Very	humor	and	Long	no	but	Interesting	fact	based on	Review length
R1	1	1	1	1	1	1	1	0	0	0	0	0	7
R2	1	1	2	0	1	0	0	1	1	1	0	0	8
R3	1	1	1	0	0	1	0	0	0	1	1	1	7

Machine Challenge Understanding Text Data

Step 4: Comment Vector 1: [111111100000];Comment vector 2: [112010011100];Comment vector 3: [111001000111].

The evaluation of novel comment texts is actually called classification problems, often referred to as sentiment analysis. A popular technique for developing sentiment analysis models is to implement word bags in our model building, which converts documents into vectors where scores are assigned to each word in the document.

Here is a brief idea behind a bag of words (bow) models. I hope you have a basic understanding of this.

Disadvantages of using BOW models

Machine Challenge Understanding Text Data

In the example above, we see a vector of length 12. However, when we add new sentences in the dataset, we start to face problems. Let’s see why:

First, if a new sentence proposes a new word, then our vocabulary will certainly increase, and therefore, the length of the vector will increase.
Furthermore, if the vector contains many 0s, a sparse matrix will be generated, which needs to be avoided at any cost.
Finally, we do not retain any information about sentence grammar or word ordering in text.

Are you a programming enthusiast or would you like to learn how to implement a bag of word models? If so, then you are on the right track. Read the article Implement a bag of words using Python.

Please read also: The meaning of technology and its different uses

Frequency Countdown Document Frequency

Therefore, to overcome some of the shortcomings of these word bags, we have moved to the TF-IDF vector for a given document. Given the formal definition of TF-IDF, we can define it as a numerical statistic, designed to reflect the importance of words to documents in a collection or corpus. So, what is the frequency of the term?

Simple

D: TF _{t, d} = n_{t, d} /Number of terms in the document. Here, n represented in the numerator is the number of times the term t appears in document d, so each document and term has its own TF value.

So, what is the current IDF? This is a measure of the importance of quantifiers in documents. We need IDF values because calculating TF alone is not enough to understand the importance of words. In mathematical representation, we can represent it as:

IDF_t = log (number of documents/number of documents with the term “t”).

In short, we can conclude that term frequency is the rating of the word frequency in the current document, while reverse document frequency: is the rating of the rarity of the word in a cross document.

We will once again use the previous novel review vocabulary to show how to calculate TF’s Comment 2: This novel is not humorous but interesting. here,

Step 1: Vocabulary: ‘this’, “novel”, “yes”, “no”, “humor”, “but”, “yes”, “funny”. Number of words in comments 2 = 8.

Step 2: tf “this” = (number of times the word ‘this’ appears in comment 2) / (number of terms in comment 2) = 1/8

Step 3: Again, we can calculate the term frequency of all terms and all comments in this way:

semester	Comment 1	Comment 2	Comment 3	TF (Comment 1)	TF (Comment 2)	TF (Comment 3)
this	1	1	1	1/7	1/8	1/7
novel	1	1	1	1/7	1/8	1/7
yes	1	2	1	1/7	1/4	1/7
Very	1	0	0	1/7	0	0
humor	1	1	0	1/7	1/8	0
and	1	0	1	1/7	0	1/7
Long	1	0	0	1/7	0	0
no	0	1	0	0	1/8	0
but	0	1	0	0	1/8	0
Interesting	0	1	1	0	1/8	1/7
fact	0	0	1	0	0	1/7
based on	0	0	1	0	0	1/7

Machine Challenge Understanding Text Data

The next question emerged with the advantages of TF-IDF, so we left a bag of word models. To be very correct, the vectorized TF-IDF form is very easy to calculate, we have some basic metrics to extract the most descriptive terms in the document.

Furthermore, we can easily calculate similarity between 2 documents using this technique.

But this vectorization form also has some disadvantages. Because TF-IDF is based on the principle of bag-of-word (BOW) models; therefore, it does not capture text, semantics, or anywhere in different documents that occur together.

This is the only reason why TF-IDF is only useful as a vocabulary level function. Furthermore, it also fails to capture semantics compared to topic models and word embeddings.

The ending words….

From the above section, we can conclude that the bag of words is an algorithm or a count that calculates how many times a word appears in a document. And with TF-IDF, the word will weigh. TF-IDF helps measure correlation, not frequency.

That is, the word count is replaced by the TF-IDF scores of the entire dataset. There is no doubt that the bag of words is easy to explain, but TFIDF usually performs better in machine learning models. However, there are still many obstacles in these forms of word embedding techniques. This is where Word2Vec, gloves, continuous words (CBOW), FastText, Skip Gram and more word embedding techniques are.

Contributor: Ram Tavva

Source link

What's Hot

How long will this situation continue? — Peter Okoye, Mr. Maccaroni, Jesse Jagzi and others cry over Palm Sunday massacre in Jos

Joke Silva is appointed as the chief judge and the nominations are announced!

Funke Akindele snubs Yoruba actresses Toyin Abraham and Iyabo Ojo at movie premiere

Machine Challenge Understanding Text Data

Afrobeat singer Jaijiaka is making a name for himself in the Nigerian music industry

Apostle Chibuzor’s autistic son Aboy Chibuzor married a mature woman and she received ₦ 10 million

Despite court order, MC Oluomo still boasts that only Tinubu can remove him as NURTW chairman

How long will this situation continue? — Peter Okoye, Mr. Maccaroni, Jesse Jagzi and others cry over Palm Sunday massacre in Jos

Joke Silva is appointed as the chief judge and the nominations are announced!

Funke Akindele snubs Yoruba actresses Toyin Abraham and Iyabo Ojo at movie premiere

Afrobeat singer Jaijiaka is making a name for himself in the Nigerian music industry

Our Picks

How long will this situation continue? — Peter Okoye, Mr. Maccaroni, Jesse Jagzi and others cry over Palm Sunday massacre in Jos

Joke Silva is appointed as the chief judge and the nominations are announced!

Funke Akindele snubs Yoruba actresses Toyin Abraham and Iyabo Ojo at movie premiere

What's Hot

Machine Challenge Understanding Text Data

Introduction to Machine Challenge Understanding Text Data

Create vectors with text

Word bag (bow) model

Disadvantages of using BOW models

Frequency Countdown Document Frequency

The ending words….

Related Posts