Comparison of text-image fusion models for high school diploma certificate classification

File scanned documents are commonly used in this digital era. Text and image extraction of scanned documents play an important role in acquiring information. A document may contain both texts and images. A combination of text-image classification has been previously investigated. The dataset used for those research works the text were digitally provided. In this research, we used a dataset of high school diploma certificate in which the text must be acquired using optical character recognition (OCR) method. There were two categories for this high school diploma certificate, each of which has three classes. We used convolutional neural network for both text and image classifications. We then combined those two models by using adaptive fusion model and weight fusion model to find the best fusion model. We came into conclusion that the performance of weight fusion model which is 0.927 is better than that of adaptive fusion model with 0.892.


Introduction
Information extraction from a high school diploma certificate are necessary for the uninversity admission commitee during new students admission. Submitting a high school diploma certificate are one of many requirements that must be fullfilled by the school leaver. In this digital era the high school diploma certificate are commonly submitted electronically using file scanned document that will later be classified by the admission commitee member of the university.
Many manual labors are needed to verify and classify the submitted documents and it raises a question whether it is posible to automate the process. The high school diploma certificate contain texts and images. Information extraction from text and images in documents for classification is explained as follows.
File scanned document's texts can be extracted using OCR. There are no guarantee that OCR accuracy will be 100%. The error in OCR usage raises a question whether the OCR results will affect the accuracy of text classification in folowing process.Taghva's research using a small collection of documents with long paragraph proved that OCR errors do not affect the accuracy of text classification [1].
However, in the following years, Taghva proved that using automatic correction on the documents with OCR error would improve text classification accuracy [2]. Another reasearcher [3] concluded that OCR error will affect greatly on text classification accuracy if the disturbed words are significant for specific classes. In [3], three methods of documents representation were introduced to improve the accuracy of text classifcation in which texts were acquired through OCR. The three method introduced in [3] were the elimination of stop words, lemmatization, and n-grams of character.
Another aproach to improve OCR accuracy is background elimination [4]. This method worked by comparing three OCR software and aplying the background elimination. The research proved that background elimination can improve OCR accuracy.
Image resolution also affects OCR accuracy. A research to improve OCR accuracy of low resolution image has been done and showed good results [5]. The research used three steps method namely resizing, sharpening and blurring to improve OCR accuracy.
There are many research works on text classification model. In [3], there were four methods mentioned for text classification, centroid, support vector machine (SVM), knearest neighbor (k-NN), and Naive Bayes (NB). Some researchers agreed that decision three also was a feasible method for text classification [6]. An application of term weighting matrix in SVM proved an improvement of SVM performance [7]. The most recent research shows a trend of convolutional neural network (CNN) usage in text classification and proved better performance [8,9].
CNN achieved a good performance not only for text classification, but also for image classification. Although CNN consumes a great computation resource and requires long time to train, some method are still available to solve those problems [10]. CNN is not quite good for image classification if the image contains many objects with the variation of shapes and sizes, and cluttered [11]. However, all images in high school diploma certificate are at the same shape and size and are not being cluttered; thus, CNN is still a feasible method.
Recently, the combination of text-image classification has been a new developed model. There are two papers in textimage classification one by Guo Li and Na Li [12] and another one was by Fangyi Zhu et al [13]. Guo Li and Na Li used an adaptive fusion model, while Fangyi Zhu et al used a weight fusion model with decision strategy.
Although Guo Li and Na Li claimed that the proposed adaptive fusion model were compared with the weight fusion model, it was not clearly explained whether the weight fusion model applied the decision strategy. Both text-image classification models use dataset in which text data have already been digitally provided. In our dataset of high school diploma certificate the text must be acquired using OCR.
Our contribution are the addition of OCR pre-processing in the text classification sub model of the text-image fusion model, and clear comparison between adaptive fusion model and weight fusion model on our dataset.

The Dataset
The dataset consisted of 1555 files, splited into three, 870 for training, 218 for validation, and 467 for test. Fig. 1 shows the image examples from each class. The acquisition of this dataset has been approved by the admission committee of the university considering for research purpose only and for the development of automated high school diploma certificate classifier for the admission system. The dataset will be kept from public access with a purpose to keep the privacy of its owner.

Text Classification Model
We trained text classification model separately from the image classification model. Text pre-processing after OCR process included converting to lower case, removing stop words, converting numeric to letter, removing word with one letter only, and removing multiple spaces. Converting numeric to letter was deemed necessary with a consideration that graduation year information is written in numeric, and we use word embedding vector. This conversion was able to ascertain that all numeric information was properly embedded with vector values.
As seen in Fig. 2 we made branch layer for graduation year categories with 3 classes (2016, 2017, and 2017) and for high school categories with 3 classes (non-vocational, vocational, and religious). The model consisted of 1-dimensional convolutional layer with kernel size 3, 1-dimensional max pooling layer, followed by hidden layer, ReLU activation layer and output layer with node for each class.

Image Classification Model
Input image for our image classification model was resized to 160 × 128. This resizing process was necessary to reduce the computation process to make model training faster, but adequate to ascertain that there were no missing information from the image. The image classification model had 3 2-dimensional convolutional block and 1 fully connected block. First convolutional block had 32 filters, 11 × 11 kernel size, and 4 × 4 strides, and 2 × 2 max pooling. The two following convolutional block had 64 filters, 3 × 3 kernel size, 1 stride, and 2 × 2 max pooling. The flatten block had ReLU activation layer and output layer with node for each class. Fig. 3 shows the illustration of this model. Table 1 presents the difference between Guo Li and Na Li [12] and Fangyi Zhu et al [13] proposed fusion model. The adaptive fusion proposed by Guo Li and Na Li can be explained as follows. A data (m, t) contained images and texts. The developed image classification model had training accuracy aimg(i) and probability pimg(m, i) for i class. The text classification model had training accuracy atext(i) and probability ptext(t, i) for i class. For data x the combined probability p(x, i) for i class could be calculated using (1), and the data x was classified to a class with the largest p(x, i) value.

Fusion Model
The weight fusion model used regularization parameter λ to control the balance between text classification and image classification model. The value of λ was set between 0 and 1. After the probability p(x, i) of data x for each i class has been calculated, data x were classified to the largest p(x, i) value. The formula for this weight fusion model can be examined in (2).
The equation for the final decision strategy are written in equation (4). Fig. 4 depicts the structure of fusion model for our research. We compared fusion model from [12] and [13]. We were not able to apply the decision strategy because our dataset were different from [13]. The difference with [13] is that the indiscriminative words were always available in our text classification model.

Text Classification
For the adaptive fusion model, we need to get the accuracy results on training dataset for each class. This accuracy on training dataset was required to calculate the adaptive weight. Table 2 shows the accuracy of text classification model on training dataset. The text classification model was not overfitting nor under fitting. This can be confirmed by the learning curve in Fig. 5. This text classification model was trained for 60 epochs. Table  3 shows the performance of the text classification model on test dataset with the model accuracy of 0.925.   Table 4 and Fig.6 respectively show the image classification model accuracy on training dataset for each class and the learning curve on for 60 epochs. The image classification model performance overall was not better than the text classification model as seen by in Table 3 and Table 5. This model accuracy was 0.886.

Image Classification
It can be seen that the image classification model performance for graduation year categories was far below the text classification model. Precision for class 2018 was 0.641 far below the text classification model with 0.874.
Although the image classification model performance overall was below the text classification model, the precision for religious high school type was 1 with recall 0.952. It was better than the text classification model.

Fusion Models
The implementation of adaptive fusion model required us to calculate the adaptive weight as shown in (1). Table 2 and Table  4 were used to calculate the wimg(i) and wtext(i). The results of adaptive fusion model on test dataset are shown in Table 6. Adaptive fusion model accuracy is 0.892.   Table 7 shows that the best accuracy of weight fusion model was 0.927 with λ value 0.02. The performance of weight model with λ value 0.02 is shown in Table 8. Comparing Table 6 and Table 8 it shows that the weight fusion model outperformed the adaptive fusion model. Precision for each class in the weight fusion model was better than the adaptive fusion model. The best model for high school diploma certificate classification was found in the weight fusion model. Paper [12] claimed that adaptive fusion model were better than weight fusion model, but our research proved the opposite. The accuracy comparison of each model is shown in Table 9.

Conclusion
In this paper we have implemented OCR pre-processing to the dataset to acquire digital texts. We used text classification model with digital text from OCR as input. We also used image classification model, which was trained separately. We compared two fusion models from previous research works [12,13]. We trained text classification model and image classification model with fewer epoch than [13]. This research found that the accuracy of weight fusion model with 0.927 outperformed that of adaptive fusion model with 0.892. The limitation of our research is that decision strategy from [13] could not be implemented for our dataset, because the indiscriminative words in the dataset were always available. For the future research, it is suggested that the development of general-purpose decision strategy are not dependent on text features dataset.