State-of-the-art methods in healthcare text classification system: AI paradigm

Send correspondence to: Jasjit S. Suri, Advanced Knowledge Engineering Centre, Global Biomedical Technologies, Inc. Roseville, CA, USA, Tel: 916-749-5628, Fax: 916-749-4942, E-mail: jsuri@comcast.net

Front. Biosci. (Landmark Ed) 2020, 25(4), 646–672; https://doi.org/10.2741/4826

Published: 1 January 2020

Download PDF

Brower Figures

Cite

Abstract

Machine learning has shown its importance in delivering healthcare solutions and revolutionizing the future of filtering huge amountd of textual content. The machine intelligence can adapt semantic relations among text to infer finer contextual information and language processing system can use this information for better decision support and quality of life care. Further, a learnt model can efficiently utilize written healthcare information in knowledgeable patterns. The word–document and document–document linkage can help in gaining better contextual information. We analyzed 124 research articles in text and healthcare domain related to the ML paradigm and showed the mechanism of intelligence to capture hidden insights from document representation where only a term or word is used to explain the phenomenon. Mostly in the research, document–word relations are identified while relations with other documents are ignored. This paper emphasizes text representations and its linage with ML, DL, and RL approaches, which is an important marker for intelligence segregation. Furthermore, we highlighted the advantages of ML and DL methods as powerful tools for automatic text classification tasks.

Keywords

Text classification

Documents

Corpus

Social Media

Input Text Characterization

Artificial Intelligence

2. INTRODUCTION

Machine learning (ML) offers intelligence that initially helps in filtering text into its category. The process is well known as text classification or categorization (TC) (1), which is an area where text documents are automatically categorized into predefined categories. Nowadays, technology is changing due to the emergence of Web 4.0 and social networks such as Google+, Facebook, and bloggers have changed the phenomenon of human life. Thus, the learning paradigm has drawn everybody’s attention. Deep learning (DL) profoundly impact our lives and helping industrial evolution to global businesses. Within a span of few years, advances in applications such as autonomous driving, robots performing jobs, real estate, online advertising, photo tagging, speech recognition, machine translation and chat bots have proven their effectiveness of DL approaches. The DL approaches in text based healthcare system has shown potential to automate the classification processes and evolve new error free paradigms. Further, such learning paradigms can help the healthcare surveillance where healthcare related key words can play an important role to spread awareness. The paradigm can help the patients for awareness related to disease, procedures and cure related information. While practitioners can understand the symptomatic behavior of infectious diseases, its propagation and patient’s feedbacks can help them to improve the quality care services. It is therefore imperative for the text miner and radiologists to learn about DL and how it differs from other approaches of Artificial Intelligence (AI). The next generation of radiology or healthcare text mining will see a significant role of DL and will likely serve as the basis for augmented radiology (AR) and healthcare surveillance. Better clinical judgment based on text will help in improving the quality of life and in life saving decisions, while lowering healthcare costs. A comprehensive review of DL as well as its implications upon the healthcare is presented in this review.

The human brain recognizes the particular object by forming a representational network of neurons from visual cortex and audio cortex. The process is known as holistic process arranged in hierarchical manner. A human brain is represented in Figure 1, consisted of neuron layers arranged in hierarchical fashion. Neurons are basic computational units on input data. The computation involves from lower layer neurons to higher layer neurons through a representational network. Basically, neurons are associated with five layers such as primary visual cortex (V1), secondary visual cortex (V2), inferotemporal cortex (IT), posterior and IT-anterior layers.

Figure 1

A neural network representation of human brain (image courtesy to Atheropoint).

3. BIOLOGICAL NEURON MODEL

A biological neuron or nerve cell is electrically excitable cell unit uses mechanism of electro-chemical signaling to process and transmit information and also known as “brain cells”. The brain processes information like encoding and retrieval using chemicals and electricity. Three components: dendrites to receive input signals from other neurons, soma (bulbous cell body, cell nucleus) is a processing unit of neuron and axon to transmit the signals to other neurons are responsible to create representational network. Artificial neuron resembles the architecture from biological neuron. Like biological neuron, artificial neuron has three units as input, processing unit and output. Artificial neuron uses combination of summing unit and activation to produce output.

In text processing system each term can act as a feature, and the best set of features can be utilized in neural network model for prepare a representational network that can be utilized for text-based classification task. Further, the model can be extended for the deep learning based algorithms that have characteristics to identify optimized feature sets for best representational network. If we resemble the text corresponding to the biological neuron model we can say the dictionary terms are inputs, contextual or semantic linking of features are processing unit while the predicted classes are the output. The presented Figure 2 shows the linking of biological neurons with artificial neural network model in textual context.

Figure 2

Biological neuron model and its sematic representation with ANN (permission pending).

Text-based learning has led the foundation of intelligence, which requires a kind of text representation that utilizes the capabilities of information retrieval (IR) and ML. Web and mobile technologies are helping people to provide them comfort and a level of intelligence when they require suggestions regarding products and services (2), election prediction (Zolghadr, Niaki, & Niaki, 2018), ham-spam email detection (3), movie categorization (4), social media information filtering (5), and healthcare information filtering (6), (7). Traditional IR from text data requires a deeper intelligence in text classification and clustering.

Such IR-based systems rely on key words-based techniques. The key words-based IR models give inaccurate results and lack with poor intelligence in feature extraction (FS). Therefore, research in this area has targeted ontology-based and computational knowledge modeling (8) to improve the classification task. The foremost requirement in this domain is to create a generic performance evaluation model which can easily identify the effectiveness of the classification task. Generic performance evaluation modeling is presented by Srivastava (9) for text classification. In general, a text classification process follows the generic steps mentioned in Figure 3. The text classification system builds the model based on the features of text documents. First, the text document is divided into training and testing categories using cross-validation protocols such as K2, K4, K5, K10, Jack Knife, and so on. These cross-validation protocols can be used to show the effectiveness of the prepared model, which gains a higher generalization ability as the size of the training data increases. The identified training features along with the corresponding ground truths prepare a learnt model (predictive model) for generalization over unlabeled documents.

Figure 3

Architecture of machine learning model.

Now days, deep learning (DL) approaches are popular and dominate over ML techniques. These techniques are able to characterize input text in an efficient manner and are able to give better performance. The comparative DL- and ML-based research in text classification still requires a systematic overview to understand where it stands in this domain. In this review, we aimed to clearly visualize the aspects related to DL- and ML-based techniques which are used for the characterization of input text. How feature selection (FS) techniques give advantages in managing text-related representations that directly link the classification task with the performance is also discussed. To the best of our knowledge, there are no similar studies which show a clear depiction of the work done in text classification using ML and DL algorithms. Most of the studies are based on only classification-based approaches or feature-related aspects, while no work has tried to illustrate the idea of input representation and its linkage with ML and DL approaches. This study will significantly help the community to enhance the holistic knowledge of its audience in text classification. It presents detailed data-related representations and their pros and cons with ML or DL performances. ML approaches intelligently utilize information representations in terms of feature sets. These features are identified by the FE algorithms. Further, FS selection algorithms are used in the model preparation phase.

In contrast to ML approaches, the DL model helps to identify correct categories of unlabeled datasets by providing holistic modeling of an artificial neural network (ANN). The author in (10) showed that the deep neural network based long short term memory (NN-LSTM) approach is effective in textual feature representation. They used the Word2vec method to convert all Wikipedia article (dataset) terms into a feature vector and showed that the LSTM network outperforms the simple Bag of Words (BOW) model. The Word2vect method is a representation of words in vector space (11). In text classification, one important challenge is dealing with dimensionality feature representations which further degrade the learning performance of the classifiers during the training phase. SVM (12), Naive Bayes (NB) (13), and k-NN (14), (15) classifiers are used frequently to learn such patterns from the datasets. Several different representation schemes are proposed in (16) that construct feature vectors in a weighted (frequencies) form of concrete words or groupings of words such as bigrams and n-grams (17), phrases.

The ML paradigm is presented in Figure 4. The classification algorithms prepare learning coefficients based on offline available data, which is further utilized for online classification. The classification task follows the simple steps of ML modeling, where training features contribute to improving the learning coefficient of the model and are further able to improve the model for better generalization. Here, improvement of the performance of the classification task is governed mostly by first extracting the appropriate features; then, FS helps in the model learning phase. The conventional BOW model is a filtering approach where key words are used as training features. In general, these techniques are used as preprocessing tools such as segmentation, tokenization, part of speech (PoS) tagging, entity detection, and relation detection (18) and commonly used in natural language processing (NLP). The frequency of specific words, entities are very large in size in the corpus, so such objects require dimensionality reduction. Methods such as TF-IDF, LDA, SVD, PCA, t-SNE, and so on (11), (20) are used to consider only important words for classifier generalization.

Figure 4

Machine learning paradigm for text classification task.

4. DEEP LEARNING

A convolutional neural network (CNN) is a DL model inspired by the working principle of the animal visual cortex. It is a feedforward neural network where multilayer perceptions are arranged in such a way as to require minimal preprocessing. Yoon Kim (19) mentioned that CNN can help in NLP tasks by identifying sequences of patterns with sizes of two, three, or five words. CNN can easily identify n-gram patterns such as the n-grams “very hot” or “I hate” regardless of their positions in the sentence. Studies using DL approaches show that it has the power to improve results in computer vision (20). In the language processing domain, studies using DL mostly try to learn word vector representations via natural language models (21)–(23) and further use these learned vectors for classification (24).

LSTM (25) is a technique which has offered high accuracy in NLP tasks. Gated Recurrent Units (GRUs) (26) are simpler versions of LSTM and play a key role in larger systems to form dynamic memory networks to address complex tasks such as Question-Answer systems (27), (28) and speech recognition (29). Examples of such systems are PoS tagging and sentiment analysis (30), which are achieved by bi-directional LSTM-CRF (31) and tree-LSTMs (32).

In DL-based modeling, effective word vectors are formed from a 1-of-V encoding scheme projected onto a lower-dimensional vector space via a hidden layer. Feature extractors encode semantically similar features (words) in dense representations. Euclidean and cosine distances are used to measure the similarity in low-dimensional vector space. CNN utilizes convolving filters for local features (33). Such DL modeling has shown excellent results in sentence modeling (34), query retrieval (35), and semantic parsing (23). DL modeling is shown in Figures 5 (a) and 5 (b) and the working details are mentioned in the Figure 6.

Figure 5a-5b

Basic convolutional neural network (CNN).

Pictorial representation of CNN.

Figure 6

Architecture of CNN model.

To address the high dimensionality features in text classification, a study (36) showed an aggregated feature fusion approach that offers reliable results. High dimensionality is an intrinsic text classification problem which harms the classifier generalization property (37). To improve the classification performance, research in this direction has shown that supervised (38) techniques are more efficient than unsupervised dimensionality reduction techniques (39). In general, popular dimensionality reduction methods are FS and FE (40). The FE technique utilizes all dimensions of feature space; further, a condensed set of features is used to create a new transformed feature space without eliminating any of the features. The FS technique mainly performs a search to identify a subset of features among the total features based on one or more quality measures (41). Wrapper and filter approaches are popular categories of FS approaches. Both FE and FS (42) are popular for classification tasks and are linked directly with classifier performance.

Ranking-based FS approaches (43), (44) are popular filter methods where Best Individual Features (BIFs) are ranked based on high to low scores such as information gain. The ranking methods have several disadvantages such as ignoring the dependency between terms, ignoring the correlation between terms, and risk of term redundancy. Some of the popular ranking measures are information gain, chi-square (43), (45), (46). The fusion-based technique combines two individual lists of different features obtained from different feature-ranking functions (36).

Reference (47) proposed a feature fusion model for improved classification precision. The model utilizes two different layers, the first, called the feature layer, deals with text and image information based on preprocessing and classification and fuses them onto the higher (second) layer named the fusion layer for the final result.

5. REINFORCEMENT LEARNING

Reinforcement learning (RL) in text classification is a popular area where agents act in an environment to maximize the notion of learning rewards. The paper (48) described a framework for RL where an agent learns value functions from inputs to solve a classification task. The authors modeled the classification problem using Markov decision processes and an extension of the RL algorithm (max-min actor-critic learning automaton, ACLA) is induced to achieve the results. The RL method is combined with a multilayer perceptron (MLP) that serve as a function approximator. The RL methods outperforms the conventional MLP approach and performs as well as SVM.

Another study showed a deep learning reinforcement (DRL) approach that enables classifiers to learn accurately from a small subset of data. DRL is a general framework for representation learning. A few examples of such representation learning include deep Q-learning (49), (50), deep visuomotor policies (51), attention with recurrent networks (52), and model predictive control with embedding (53). The study proposed by Zhang (47) showed how to learn the structured representation for text classification. The proposed RL method learns automatically optimized structure representations from sentences. Two types of representations, namely hierarchically structured LSTM (HS-LSTM) and information distilled LSTM (ID-LSTM), which yield competitive performance, are shown. ID-LSTM selects task-relevant words while HS-LSTM discovers phrase structures in a sentence.

In this paper we have covered a wide range of text classification techniques including ML and DL methods. In ML we have mainly covered supervised learning and RL. A detailed explanation related to feature reduction is also mentioned in the paper. The types of feature reduction such as FE and FS are linked with the quality measures of the feature paradigm and show a direct link with the performance. Our review shows a direct linkage between ML and DL approaches which are currently popular in text classification research. The DL approaches are more powerful in dealing with irrelevant feature sets than ML.

6. FEATURE EXTRACTION AND SELECTION IN ML

Technological advances have given us open platforms such as Twitter, Facebook, and Google plus to share our views in the form of text and images (54). The large number of text and image documents does not make sense until they are filtered out into some concrete categories. The filtering process helps to identify meaningful information from the data, and FS techniques aid this task. A feature has the capability to generalize unique characteristics of data. The set of similar and dissimilar features when assembled together forms the feature set. These sets of features are used in the field of ML and have shown promising results in pattern recognition with the increase in the volume of data (55). The term features and high dimensions of data are used interchangeably in the research. These dimensions must be reduced to make an effective ML model that can further help in classification tasks. In several research contexts, FS is referred to as variable selection (56), attribute selection (57), dimensionality reduction (58), or feature subset selection (59). FS is a commonly used pre-processing technique used in ML (60). FS is a process of selecting the most relevant and non-redundant features during the learning phase for the purpose of model construction (61).

The text contains high-dimensional features and can be reduced from higher to lower dimensions with the help of FS techniques. The FS algorithm consists of a search technique (62) for proposing new feature subsets with their corresponding evaluation techniques for scoring the generated feature subsets (63). The performance depends on the number of generated features and, further, on its computation during the learning phase of the model; as the number of features generated increases, the time required to compute the data in order to evaluate the performance increases. Scientifically, the curse of dimensionality (COD) (64) degrades the performance of the model. By COD we mean that the data dimensionality increases with higher pace and further it increases sparsity in the data. Such a large dataset requires a simplified model to make it less complex and more interpretable (65). FS helps in this direction and makes an effective contribution. Meaningful FS can train the model in a reduced time frame and it is only possible due to low dimensions of the data. The data contains the irrelevant and redundant features especially when we have a low size of training samples. So to overcome the problems discussed above, feature reduction techniques are required. There are two types of feature reduction, as shown in Figure 7.

Figure 7

Feature reduction types.

1. Feature selection (FS)

2. Feature extraction (FE)

6.1. Feature reduction

Mathematically, for a given set of features F = {x₁, x₂, x₃ … x_n}. After the FS process, a new feature set Fʹ is generated, which is a subset of the initial set F, where Fʹ = {xʹ₁, xʹ₂, xʹ₃… x ʹ_m}. If we have n features then the number of possible subsets is equal to 2ⁿ. It is impossible to enumerate through each subset and check how well it performs because it relates to an NP-hard problem. The performance of the classifier used increases to a certain extent with increases in the number of features. Classifier performance starts depreciating or becomes saturated with the increase in the number of features, as shown in Figure 8. Considering the number of training samples as fixed, we can conclude that the classifier’s performance will usually degrade with a large number of features.

Figure 8

Classifier performance with increasing number of features.

The evaluation metric strongly influences the FS algorithms. According to metrics, the FS algorithms are divided into three categories as shown in Figure 9.

Figure 9

Three important types of feature selection.

6.1.1. Filter methods

Ron Kohavi and George (66) classified the FS techniques into filter and wrapper 0 methods. The filter method acts as a preprocessing step to select the features on the basis of rank. The highly ranked features are further processed to the predictors (55). In the wrapper method, FS is done on the basis of performance of the predictor wrapped with the search algorithm to find the best possible feature subset. In the embedded method, FS acts as part of the training process. FS is performed without splitting the data into training and testing sets (67), (68). The relevant feature is selected when the model is created.

In the filter method, the ranking method is used for variable selection. The term “ranking” refers to the numerical value. This is the simplest method. A rank is assigned to the features present in the dataset and a threshold is set for the dataset according to the most suitable ranking criteria. The features whose ranks are below the threshold are removed as they are considered to be irrelevant. The selection of features is independent of ML algorithms as the filter method is applied before classification. Various statistical tests are performed on the dataset and scores are measured. These scores play a very crucial role in the FS. It is quite challenging to determine the relevancy of the feature. The contribution provided by the researchers to the mentioned problem is discussed in (66), (67). The researchers discussed the fact that if a feature is independent of the class labels then it is regarded as an irrelevant feature. The diagram in Figure 10 represents the process used in the filter method.

Figure 10

Block diagram of filter-based feature selection.

A few popular filter-based methods are highlighted in research and discussed below:

1. Principal component analysis (PCA) (69): PCA is a data analysis technique that uses orthogonal transformation to convert the correlated data into uncorrelated data called principal components. It is used to find the direction of most variation in the dataset.

2. Information gain (IG) (70): IG is used to find the dependencies between two variables.

3. Chi-square (71): A statistical similarity between two variables.

4. Correlation based feature selection (CSF) (72), (73): CSF identifies the correlation between two variables.

5. Fisher score (74): This technique calculates the Fisher score between two variables.

6. ANOVA (74): A method of checking the significant similarity between two similarities.

7. Linear discriminant analysis (LDA) (75): LDA identifies the linear similarity between two terms.

8. Pearson’s correlation (76): This methods identifies Pearson’s correlation between the terms of the documents.

6.1.2. Wrapper method

The subset of features is used to train the model. The performance of the previous model is taken into consideration to add or remove the features from the subset. This method is quite expensive. A lot of computation is done. The method uses the predictor as the black box (xyz) and the performance of the predictor as the objective function. The diagram in Figure 11 represents the process used in the wrapper method.

Figure 11

Block diagram of wrapper-based feature selection method.

Some wrapper-based methods are discussed below:

1. Forward selection: In the initial phase of the forward selection method, we have no features in the model. The features are added to the subset if the performance of the model is improved. This addition of the features takes place until the performance of the model improves with the addition of features. The algorithm stops adding the features to the feature set when saturation is achieved or a decrease in the performance occurs.

2. Backward selection: In the initial phase of the backward selection method, we have all the features in the model. These features are then removed one by one if the performance of the model is improved. The removal of features takes place until the performance of model improvised.

6.1.3. Embedded method

The embedded method is a combination of the filter and wrapper methods. This method is used by the algorithms possessing their own built-in FS methods. The diagram in Figure 12 represents the process used in the embedded method.

Figure 12

Block diagram of embedded-based feature selection method.

Some embedded methods are discussed below:

1. Lasso regression

2. Ridge regression

3. Decision tree

FS is an optimization problem. The process consists of two most common aspects (77): first, search techniques, where a search algorithm is used to generate the most relevant feature subsets used in robust model construction, and second, the application of an evaluator, an evaluation algorithm which decides the goodness of a feature subset. It returns the information about the correctness of the search method used. The block diagram in Figure 13 shows the steps followed in the FS technique.

Figure 13

The working of feature selection techniques.

Some other FS techniques are mentioned below.

1. Exhaustive algorithm: In exhaustive search, if a dataset contain n features then the count of features is 2n and each feature subset is tested to find the most relevant feature set with the lowest error rate. This is possible if the count of features in the feature set is low.

2. Best-first algorithm: In the best-first search algorithm, the nodes of the graph are explored using the specified rule. This algorithm is often used in path finding. A* and B* are examples of the best search algorithm.

3. Simulated annealing: In simulated annealing, the approximation is performed on the global optimum of a given function. It is used in discrete search space.

4. Genetic algorithm: A genetic algorithm (78) is a search-based optimization algorithm used to find the maximum or minimum of a function. It is based on the concept of natural selection.

5. Greedy forward selection (79)–(81): This is a computationally efficient algorithm that does not over-fit the data. Errors made in the early stages of the algorithm cannot be corrected later.

6. Greedy backward elimination: This is a computationally efficient algorithm that can solve the error by looking at the complete model. In this algorithm we need to start with the data that are not over-fitted.

7. Particle swarm optimization (82): This is an optimization algorithm used to optimize the problem by recursively trying to improve the solution. The algorithm makes no assumptions about the problem to be solved.

8. Targeted projection pursuit: This algorithm is used to explore a complex dataset to find the features of high interest.

9. Scatter search (83), (84): This is a meta-heuristic and optimization algorithm that uses an extrapolation and interpolation strategy instead of a randomized strategy to find the best solution.

Variable neighborhood search (85), (86): This is also a meta-heuristic and optimization algorithm. In this algorithm, the distant neighbor is explored in the current solution. If an improvement is made then only it moves to the further solution.

6.2. Feature extraction

FE is an attribute reduction process. Unlike FS, which ranks the features based on various techniques, FE actually transforms the attributes. The transformed features are linear combinations of the original attributes. The majority of features are managed by the FEAT_NUM_FEATURES build setting for FE models. The model built after the FE process is of high quality because the data has fewer and meaningful features. In FE, a higher dimension feature set is projected onto a smaller number of dimensions. It is quite useful for data visualization, since complex data of reduced dimensions.

The FE process has several applications such as latent semantic analysis (LSA), data compression, data decomposition and projection, and pattern recognition. FE can be used to speed up and increase the effectiveness of learning. The following are popular FE methods.

1. Term frequency

2. Inverse document frequency

3. Term frequency-inverse document frequency (TF-IDF)

4. Bag of words (BOW)

5. Sentiment analysis

6. Word embedding

FS has several applications in the analysis of gene microarray data (67), (87)–(90). The dataset contains features which are highly correlated with the target feature. This high correlation between the features leads to irrelevant features which must be reduced. By reducing the extra dependent features, improvements in the computation task and estimators’ accuracy are achieved. An FS criterion is required to know the relevancy of the features before removing the irrelevant features.

7. DISCUSSION

The works mentioned above provide evidence of the advantages of ML approaches and their exclusive role in text classification and further establish a link with their corresponding performances. The first thing that has been noticed is the transformation of the high dimensionality of features to a concrete set of features for accurate training using ML approaches. Most of the research in text classification deals with this aspect using FE and FS methods. Some of the studies are shown to have improved performance by combining feature sets using different filtering-based FS approaches, hypothesizing that fusion of feature sets might improve the classification performances.

Some of the studies illustrated the power of DL algorithms where the important sequence of term patterns is automatically identified by using convolutional and pooling layers. The technique is efficient for automatic text classification, which is an important area of information filtering due to the emergence of Web technologies. Nowadays, social media conversations are providing an open platform where people discuss almost all the issues related to their personal and professional life views. These platforms are transforming human lives by giving them suggestions about the quality of products and services and providing a secondary mode of suggestion that improves the quality of life. The generated social texts help in recommending products, predicting election polls and personality, identifying the spam category of emails, summarizing text into appropriate topics, and many more. The current state of proposed text review compared to last six years is mentioned in Table 1(C₁ to C₁₃) & Table 2 (R₁ to R₇).

Table 1 Benchmarking table

C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13
SN	[Ref. #] Year	Implemented technique	Feature type	Feature selection type	Dataset	Domain of analysis (ML, DL, RL)	Classifier type	Advantages/ disadvantages	Application	Social Media Data used	Performance Type	Validation used
1.	[3] 2012	SMS text size and graphics as feature time-dependent features	Word count, image, temporal	Frequency, graphics, and time	SMS	ML	SVM, k-NN	-	SMS-spam detection	-	PRE, REC	√
2.	[82] 2014	Binary PSO +mutation	Text data	Wrapper	SMS	ML	Decision tree	Optimizing accuracy	Ham-spam detection	-	Weighted cost	X
3.	[91] 2015	Alzheimer’s disease images	-	Filter methods	Alzheimer’s disease	ML	SVM-R	-	Image analysis	-	ACC	√
4.	[56] 2016	NLP, information fusion and fine grained level, cross domain, cross lingual	Word count, sentence count	Frequency	Social media sentiment dataset	ML and DL	NB, SVM, maximum entropy, CNN, LSTM, RNN	Advantages, disadvantages, comparison discussed.	Opinion summarization	Amazon, Twitter, Yelp	ACC, PRE, REC, F-Measure	X
5.	[53] 2017	Subjective information extraction	Phrase	N-Gram	Customer Reviews	ML	NB, SVM & DT	-	Social media sentiments	Amazon, Flipkart	ACC	√
6.	[54] 2017	Important words and semantics	Frequency and word connection	TF-IDF & SVD	News filtering	ML	Bernoulli Naive Bayes, SVM	-	Automatic Indonesian news classification	-	ACC, ROC	X
7.	[55] 2018	Word, phrase, line based segmentation	Phrase, line	Frequency	Text segmentation	ML	Naive Bayes	-	Text filtering	-	AUC	X
8.	[52] 2018	BOW	Word multi-set	Frequency	E-mail	ML	Tree, Bayes, SVM, k-NN	-	Spam-ham detection		ROC, ACC	X
9.	[57] 2018	NER, noise processing, NLP	Word count	Frequency	EHR filtering	ML and DL	SVM and CNN	HIS performance	Patients’ healthcare records	-	ACC, AUC, PRE	√
10.	[9] 2018	BOW model	Word frequency	BOW	SMS, Reuters (R8), disease, WebKB4, TwitterA	ML	SVM-L, MLP, AdaBoost, SGD, DT	-	-	Twitter	AUC, ACC, PPV, SEN, SPE	√
11.	Proposed Review 2019	BOW, noise, word embedding,	Word, sentence frequency	Both feature extraction and selection	Facebook, Amazon, Twitter & Google +	ML	SVM, NB, DT, RF, NN	Text classification	Text characterization	Twitter, Facebook, Amazon, Google +	ROC, ACC, REC,PPV, SEN, SPE, AUC	√

Symbols: √: Validation inclusion; X: Validation non-inclusion

Table 2 Comparison of previous reviews with quality indicators

Attributes	[55] (2014)	[125] (2016)	[126] (2017)	[127] (2018)	[128] (2018)	[129] (2018)	[106] (2018)	[130] (2019)	Proposed Review
R1: Diversity of datasets	X	X	X	X	X	X	X	X	√
R2: DL approaches	X	X	X	X	X	X	X	X	√
R3: RL approaches	X	X	X	X	X	X	X	X	√
R4: FS using fusion approaches	X	X	X	X	X	X	X	X	√
R5: ML algorithms of mixing classifiers with FS	X	X	X	X	X	X	X	X	√
R6: ML algorithms with mixing FS with classifiers	X	X	X	X	X	X	X	X	√
R7: FS using LR	X	X	X	X	X	X	X	√	X

Abbreviations: DL: Deep learning; RL: Reinforcement learning; FS: Feature selection; ML: Machine learning; LR: Logistic regression, Symbols: √: Attributes inclusion; X: Attribute non-inclusion

The available platforms are opening a new era of text analysis where ML/DL approaches can be used to efficiently utilize the growing data into some meaningful patterns. Such generated data contains more noise as people are using natural language based contextual terms during conversions. In such a scenario, DL approaches are less complex to deal with irrelevant features and can be efficiently used for automatic text classification rather to approach feature reduction techniques to transform all the features into equivalent concrete feature sets for the application of ML algorithms. The equivalent text representations are also important for dealing with a huge amount of text data. We have discussed a few reinforced techniques which are based on information distilled and hierarchical structural patterns. The reinforcement techniques are used to gain improved contextual information with the help of agents using functions. In other words, comparing supervised ML and reinforcement approaches are effective in different scenarios.

In summary, most of the research has referred to the Bayesian model (72), (92), SVM (93), (94), NN (95), boosting methods (96)–(98), the Rocchio algorithm (99), (100), and k-NN (14), (15), (101). It is interesting to note how DL methods (10), (96), (102)–(104) consider the improvement achieved over ML methods.

7.1. Critical analysis of features in text classification

Considering all the features in a classifier’s training makes the process complex (92). For example, it very expensive to train an NB classifier using complete features; further, in such cases, the FS process helps in selecting a subset of features (72) and further improves the classification task. The high dimension data must be reduced to low dimension data to avoid the curse of dimensionality and to build a better ML model. The FS process eliminates the noise terms and increases the performance of the classification task. By a noise term (105), we mean a term that misleads the representation of the document and increases the error in generalization. Due to training with the noise term, the learning method misassigns categories to the document. Such an incorrect training property leads to incorrect generalization and is known as overfitting (106). FS can be viewed as a method of replacing a complex classifier by a simple one; the process helps weaker classifiers while statistical text classification approaches have used. In the case of Bernoulli NB, which is very sensitive to noise features (107), some form of FS is required to improve the classification task. In 1960, Maron and Kuhns (108) described one of the first NB text classifiers. Lewis (1998) (109), (110) focuses on the history of NB classification. Bernoulli and multinomial models and their accuracy for different collections are discussed by McCallum and Nigam (1998) (111).

Kibriya in (2004) (112) presented additional NB models. Domingos and Pazzani (1997) (113), Friedman (1997) (114), and Hand and Yu (2001) (115) analyze why NB performs well although its probability estimates are poor. The first paper also discusses NB's optimality when the independence assumptions are true of the data. Pavlov (2004) (116) proposed a modified document representation that partially addresses the inappropriateness of the independence assumptions. Bennett (2000) (117) attributes the tendency of NB probability estimates to be close to either 0 or 1 to the effect of document length. Ng and Jordan (2002) (118) show that NB is sometimes (although rarely) superior to discriminative methods because it reaches its optimal error rate more quickly. The basic NB model presented in this chapter can be tuned for better effectiveness (119,120). The problem of concept drift and other reasons why state-of-the-art classifiers do not always excel in practice are discussed by Forman (2006) (121) and Hand (2006) (122).

The limited number of labeled points in training sample data mean that ML modeling is prone to overfitting (123) and poor generalization. The model achieves overfitting when it achieves a good fit on training data but does not generalize well on unseen data. Preventing overfitting in ML is a challenging task. Cross-validation (124) is a powerful method of preventing overfitting. In standard k-fold cross-validation, the data are partitioned into k subsets, generally called folds, and then the algorithm is trained iteratively on (k – 1) folds while using the remaining fold (holdout sets) as the test set. Bagging, boosting, ensembling, regularization, removing features, and early stopping criteria are among the important aspects used to deal with the overfitting issue in the ML framework. Meanwhile, the following factors are used to handle overfitting in the DL framework.

7.2. Reduction in network capacity

7.2.1. Applying regularization

7.2.1.1. Drop layers

The above factors show that removing layers or reducing the number of elements in the hidden layer, adding a cost to loss function for large weights, and randomly removing certain features by setting zero can save model for overfit. Reducing too much network capacity creates underfitting issues and the model will not be able to learn relevant patterns from the training data. Ideally, we select a model which achieves a balance between underfitting and overfitting.

8. CONCLUSIONS

This is a state-of-the-art review of text representations and their effect on classification performances. This is one of the first studies of its kind which shows the role of ML/DL for assessment of input text characterization using FE and FS approaches. The architecture of the paper was divided into the ML approaches and their link with classification paradigms and how DL approaches are strengthening the classification task. Further, the study showed the role of feature reduction in the characterization of input text while adapting the ML and DL models for the text classification task. We also covered the RL paradigm for text representations. We conclude that the ML and DL methods are very powerful for the classification task. We anticipate that rapid growth of these tools can help in developing improved classification strategies for information filtering.

Abbreviations

Abbreviation

Expansion

RL: Reinforcement learning, ML: Machine learning, TC: Classification Or Categorization, DL: Deep learning

References

Front. Biosci. (Landmark Ed) Print ISSN 2768-6701 Electronic ISSN 2768-6698