END TO END MODEL BUILDING FOR TWEET CLASSIFICATION AND DEPLOYMENT USING STREAMLIT
People raising their concerns through social media platform is a common practice in today’s age, the reason being wide spread reach of the concerns being raised. Since internet and mobile devices are ubiquitous nowadays many people make use of social media platforms like Twitter, Facebook, Instagram etc. to make their voice reach as many people as possible. But this also leads to the misuse of same. Many people also try to spread false news and rumors which causes even more trouble and unrest among the people. For this very reason it is necessary to classify the disaster related tweets into those which are about real disasters and those which are not.
Problem objective & ML formulation: As described in the introduction the objective here is to identify from whether the tweets are related to real disaster or not. Since the tweets are in the form of text for ML perspective the first step here is NLP (Natural Language Processing) i.e. to convert the text into numerical form so that the ML/DL model which will be used for prediction can understand it. Once the texts are processed this will be a binary classification problem
(Class 1: Tweets are related to real disaster, Class 0: Tweets are not related to real disaster), the classification can be done using various ML/DL models.
The Dataset: Since this problem is hosted on Kaggle as a competition the dataset is available on Kaggle platform itself.
The data consists of the following features except the target column.
id — a unique identifier for each tweet
text — the text of the tweet
location — the location the tweet was sent from (may be blank)
keyword — a particular keyword from the tweet (may be blank)
The target column has 1 for tweets which are about real disaster and 0 for tweets which are not.
The training data set consists of 7613 tweets.
LOAD THE DATA AND VISUALIZE
We load the data as pandas data-frame named “df”.
EXPLORATORY DATA ANALYSIS
- PERCENTAGE NULL VALUES IN EACH FEATURE
Since location feature has many null values (33.27%) this feature is dropped.
2. VISUALIZE CORRELATION COEFFICIENTS BETWEEN FEATURES AND TARGETS
From the above correlation heat-map we can observe that “keyword” feature has correlation value with target is 0.55 also, only 0.8% null values also so we will retain this feature.
We can also observe that our main feature “text” is highly correlated to the target with a correlation coefficient of 1.
Since “id” feature do not contribute towards identifying whether the tweet is related to disaster or not “id” feature will be dropped.
3. VISUALIZE COUNT PLOTS FOR TARGETS AND FEATURES
As observed from above imbalance is not huge between classes, so first approach will be w/o any under-sampling or oversampling technique.
As observed from above, some keywords have equiprobable disaster and non-disaster tweets, whereas some have more probability of disaster related tweets.
4. VISUALIZE WORD CLOUD FOR BOTH THE CLASSES
WordCloud is a way of visualizing frequency of words in a corpus. Words which are more frequent are displayed in bigger font-size and less frequent in smaller font-size. So the font-size gives the relative idea of frequency of words in the given corpus. Code snippet of plotting the word cloud is as below:
The above code snippet plots the word cloud as below:
Similarly word cloud for disaster related tweets was plotted as below:
From the above we can observe that except some words like (new,one,people & dont) which have high frequency in both class of tweets, there is difference in other high frequency words, for eg. in tweets related to disaster, words like (fire, storm, police, attack, flood) are more frequent and these are not more frequent in non-disaster tweets.
Before plotting the word clouds the tweet texts were preprocessed so lets take a look at the text preprocessing procedure followed.
TEXT PREPROCESSING
Steps followed for text preprocessing:
- Remove HTML tags.(Using Beautiful Soup library shown in the above code snippet)
- Remove URLs (Using REGEX)
- Remove text after @ alongwith @ (Using REGEX)
- Expand contracted words (Using REGEX)
- Remove punctuations , : ; “ ‘ . ? (Using REGEX)
- Remove special characters { } ( ) [] < > = + — _ # $ % ^ * / | \ & \n \t \r (Using REGEX)
- Remove all digits (Using REGEX)
- Convert entire text to lower case.
- Remove stop words (Using nltk library): Stop words are frequently occurring words in sentences like “the”, “is”, “and” etc. These words do not contribute much towards determining a context of a given sentence using ML or DL models, which is why they are removed before feeding the text to ML/DL models.
- Perform lemmatization. (Using nltk library) : Lemmatization is the process of converting words to its root form called lemma. It is similar to stemming with the only difference that the root words called stem in stemming may not necessarily be dictionary words but lemma are dictionary words. For eg. “running” will get lemmatized to run, “sleeping” will get lemmatized to sleep.
The utility function used for text preprocessing which contains the entire pipeline for preprocessing is as below:
TRAINING VARIOUS MODELS TO SELECT THE BEST ONE
1. VARIOUS ML MODELS WITH TF-IDF VECTORIZED TEXT
The first approach will be converting the text to vectors using TFIDF vectorization and then training Naive-Bayes and Random Forest models with these vector inputs.
TF-IDF stands for Term Frequency-Inverse Document Frequency which is a method of weighing words in document thus giving us importance of each word in the document and in the collection of documents (also called corpus).
Term Frequency(TF) is the number of times a word occurs in a document often divided by the total number of words in a document.
IDF is the number of times a document containing the word occurs in the collection of documents, thus IDF offsets the weight of high term frequency words like “the”, “for”, “and” etc. which occur frequently but are not important for information retrieval.
We will use Scikit-Learn’s TfidfVectorizer for converting our texts into vector for further feeding into various ML models.
While training the TFIDF vectorizer we can choose to look at single words in the documents (unigram) or pair/group of words (n-grams) where n = the number of words we want to look collectively at while training the vectorizer.
Two approaches have been considered here, one with only unigram and other with both unigram and bi-gram(n-gram with n=2).
The entire dataset is split into training set and cross-validation set. The TF-IDF vectorizer is trained on training set only.
We convert both the “text” and “keyword” features into vectors using the TFIDF vectorizer.
Now the above generated vectors from texts are fed to Multinomial Naive-Bayes model for training. The Naive-Bayes model uses the following rule for prediction of classes:
Here P(y) is the relative frequency of a class y in the dataset and P(xi|y) is the probability of a word xi in the text given that it belongs to class y. For whichever class y the value of y-hat is higher, will be selected as the predicted class for the given text. Code for MNB classification is as below:
Using the TFIDF vectors Decision Tree and Random Forest classifiers are also trained.
Decision tree model uses a series of nested if and else statements to reach a final decision of assigning a class to the given query, whereas Random Forest model uses many decision trees , it samples subset of data at random and trains many decision trees with them, after which it uses majority vote to assign the class to given query. For a detailed understanding of Decision Tree models work refer my blog : https://kunalbaidya.medium.com/decision-trees-in-machine-learning-46a7cc730b59
Code snippet for training the above mentioned models is as below:
Just as mentioned above we again train the 3 types of classifier models mentioned but this time with TFIDF vectorizers with both unigram and bigram.
We can observe that with bigrams also involved the TFIDF vectorizer now considers both individual words as well as pair of words while converting the given text into vectors and hence the number of features has increased to 43142 compared to 110001 in case of unigrams only.
2. VARIOUS DEEP LEARNING MODELS FOR CLASSIFICATION
(A.) USING WORD EMBEDDING WITH CNN FOR TEXT CLASSIFICATION
In the first approach the word embeddings for each word in our data vocabulary from the GloVe set are used to initialize the Embedding layer weights. While training the CNN 1D network, Embedding layer is set as non-trainable so it will use the GloVe embeddings only for training.
In the second approach let the Embedding layer learn the weights while training so the layer is set as trainable.
In both the above approach after the Embedding layer the architecture used is same:
2 blocks of conv1D layers, each block having 3 conv1D layers with 32 units followed by concatenation of the 3 layers in the block and then followed by a Max pool layer.
Above two blocks are followed by 1 conv1D layers → 1 Flatten Layer → 1 Dropout layer → 1 Dense layer → 1 output layer with sigmoid activation for binary classification.
Adam optimizer is used for optimization and binary_crossentropy as the loss to be minimized.
For monitoring model performance ROC-AUC score and accuracy score metrics are used.
We select some prominent words from the WordCloud plotted before and check whether they are present in the GloVe corpus or not.
As observed out of selected 27 prominent words from wordcloud, only 4 of them are not present in GloVe set, so it is fair enough to try GloVe embeddings for training the model for text classification.
In case of our ML models we converted the textual data to vectors using TFIDF vectorizer. In case of CNN model we will convert the textual data to vectors using keras tokenizer, which creates a vocabulary of words from the given dataset and assigns a unique number to each word.
As seen in the above code snippet the text sequences have also been padded i.e. for sequences whose length is shorter than the longest sequence zeros are added to the end of sequence to make their length equal to max_length which is the length of longest sequence/text in the dataset.
The neural_network model summary which is trained with the above mentioned data is as follows:
The code-snippet for creating the above mentioned model architecture is given below:
In the Embedding layer of the architecture we pass a matrix emb_mat as weight_initializer. This matrix emb_mat has each row as a 300 dimensional embedding vector of each word in the vocabulary created by tokenizer and the embedding vector is fetched from the GloVe set.
In the second approach we set the Embedding layer as trainable and let the model learn the embedding along with training of entire network.
(B.) USING BI-DIRECTIONAL LSTM
In Bi-directional LSTM there are two layers side-by-side, providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second. It provides one more context to the word to fit in the right context from words coming after and before, this results in faster and fully learning and solving a problem.
The model architecture used is as below:
The code for building this architecture is as follows:
(C.) COMBINING BOTH CNN AND BI-DIRECTIONAL MODELS
Both CNN & Bi-directional models are combined to try and achieve even better performance score.
A similar existing approach mentioned below achieves above 0.9 accuracy in sentiment analysis :
“Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism” by Beakcheol Jang , Myeonghwi Kim, Gaspard Harerimana, Sang-ug Kang and Jong Wook Kim * Department of Computer Science, Sangmyung University, Seoul 03016, Korea; bjang@smu.ac.kr (B.J.); nopkgogo3@gmail.com (M.K.); gharelim@alumni.cmu.edu (G.H.); sukang@smu.ac.kr (S.-u.K.) Correspondence: jkim@smu.ac.kr”
The authors have used word2vec for word embedding whereas for this case GloVe is used for word embedding.
The main idea of this approach is that 1D CNN is used for feature extraction where it processes text as a one-dimensional image and is used to capture the latent associations between neighboring words whereas LSTM is used to extract contextual information from the features obtained from the 1D CNN layers.
The authors have also used attention mechanism along with LSTM, which assigns weights to input components that are highly correlated with classification.
Also for attention mechanism both dot-scoring and Bahdanau scoring will be used to compare the performance.
Attention mechanism is a process which measures the association of each word in a sentence with other words. The association is measured with a score , higher the score higher the probability of association between words. The scoring can be performed in various ways, two of the methods used here are dot-scoring and Bahdanau scoring.
The code snippet for creating custom layer for attention is given below:
The architecture for CNN+Bi-directional model is as follows :
(D.) USING PRE-TRAINED BERT MODEL
In this approach the pre-trained BERT model is used to fetch the encoding of the text and the encoded text is then passed through two dense layers of 768 units each and finally through a dense layer sigmoid activation function for binary classification.
BERT model used is bert_en_uncased_L-12_H-768_A-12/2 which is available is tensorflow_hub library. This pretrained model has 12 encoder layers and 12 attention heads and 768 hidden units.
BERT stands for Bidirectional Encoder Representations from Transformers. Each encoder in the BERT is a transformer block
The BERT model requires 3 inputs
i) the token id : these are the words converted to tokens using the BERT tokenizer unlike keras tokenizer used previously.
ii) mask input : these are arrays with “0” values for [PAD] tokens and “1” values for the rest. [PAD] tokens are used for padding the texts having length less than maximum text length in the dataset.
iii) segment id : If there is only one sentence for the classification, total seg vector is 0. If two sentences with [sep] token separation is given , first seq segment vectors are zeros and second seq segment vector are 1’s and so on. Tweet texts being short sequences we preprocess the tweets to convert them into a single sentence by removing all punctuations, so in this case segment id will be array of 0’s with lengh of maximum sequence length.
The utility function for preparing these inputs is given below:
The code snippet for creating the network to extract features from text using pretrained BERT model is as below:
Final model architecture for prediction of tweet type using the extracted features from BERT pretrained model as input is as below:
The performance of all models mentioned above on validation data set are tabulated below in descending order of ROC_AUC_score:
From the above table we select CNN+Bi-directional LSTM model with dot_scoring attention as our best model with highest ROC_AUC_score of 0.854 for deployment. Highest accuracy score is achieved by pretrained-BERT model 0.799 .
The app for prediction of Tweet type is built using Streamlit, below is the video showing demo of how the app works:
Here is the link to the app, which the readers can play with, to predict whether a tweet is related to disaster or not.
https://share.streamlit.io/kunal51290/cs2_app_deployment/main/cs2_app.py
The app deployment python file, saved best model along with the requirements.txt file, procfile and setup file are available in my github repo. Link: https://github.com/kunal51290/CS2_APP_DEPLOYMENT
My linkedin profile : https://www.linkedin.com/in/kunal-dipakranjan-baidya-43b32b196/
References:
- Applied AI course by Applied Roots.
- https://publish.tntech.edu/index.php/PSRCI/article/view/686/225 Real or Not? NLP with Disaster Tweets, Kristopher Flint and Zachary Kellerman.
- https://www.xajzkjdx.cn/gallery/1-may2021.pdf Natural Language Processing with Disaster tweets using Bidirectional LSTM. Aryan Karnati, Shashank Reddy Boyapally, Dr. Supreethi K.P
- https://machinelearningmastery.com/best-practices-document-classification-deeplearning/
Using word embedding with CNN for text classification. - Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism” by Beakcheol Jang , Myeonghwi Kim, Gaspard Harerimana, Sang-ug Kang and Jong Wook Kim * Department of Computer Science, Sangmyung University, Seoul 03016, Korea; bjang@smu.ac.kr (B.J.); nopkgogo3@gmail.com (M.K.); gharelim@alumni.cmu.edu (G.H.); sukang@smu.ac.kr (S.-u.K.) Correspondence: jkim@smu.ac.kr