Spontaneous gaze interaction based on smooth pursuit eye movement using difference gaze pattern method

Human gaze is a promising input modality for being able to be used as natural user interface in touchless technology during Covid-19 pandemic. Spontaneous gaze interaction is required to allow participants to directly interact with an application without any prior eye tracking calibration. Smooth pursuit eye movement is commonly used in this kind of spontaneous gaze-based interaction. Many studies have been focused on various object selection techniques in smooth pursuit-based gaze interaction; however, challenges in spatial accuracy and implementation complexity have not been resolved yet. To address these problems, we then proposed an approach using difference patterns between gaze and dynamic objects' trajectories for object selection named Difference Gaze Pattern method (DGP). Based on the experimental results, our proposed method yielded the best object selection accuracy of 80.86 ± 9.57% and success time of 5,885 ± 1,097 ms. The experimental results also showed the robustness of object selection using difference patterns to spatial accuracy and it was relatively simpler to be implemented. The results also suggested that our proposed method can contribute to spontaneous gaze interaction.


Introduction
Interaction between human and computer applications is no longer unusual. The most pervasive human-computer interaction probably is the way humans touch the smartphones to control some phone applications. Furthermore, humans seem to have given trust, to some extent, to several phone applications [1]. Another typical interaction is by using the keyboard and mouse to interact with a personal computer. Interaction with computers using the brain through the braincomputer interface has also been studied extensively [2]. Other than those input modalities, interactions using voice and gaze are some choices that can be considered. Moreover, due to the Covid-19 pandemic, more people are looking forward to touchless technology [3]. Since these past few years, voicebased touchless technology has been common; however, interaction using voice is less discreet compared to gaze. As a consequence, some people may prefer to have interaction through gaze. Interaction using gaze is also promising as the gaze is fast and quite intuitive in which people tend to look at an object they are interested in before taking any other actions [4,5].
Over the last decades, human gaze has been studied as an alternative input modality to interact with computer applications. In the early stage, the gaze was studied to replace the role of mouse and keyboard-either for pointing or for selection-by fixating at a certain button [4][5][6][7]. Fixation has been used to describe a type of eye movement during which we fixate our gaze at a certain object of interest for at least 300ms [8]. Gaming and assistive technology are two applications that are mostly based on fixation-based gaze interaction for pointing and selection purposes [9,10].
Despite its current popularity, fixation-based gaze interaction greatly depends on spatial accuracy [11,12]. Prior eye tracking calibration for each participant is needed before further interaction can be performed smoothly. Therefore, this technique may not be suitable for spontaneous gaze interaction as performing calibration before using an application might be quite inconvenient.
Gaze interaction based on smooth pursuit eye movement can be beneficial as an alternative technique for spontaneous interaction. Smooth pursuit refer to a slow eye movement during pursuing a moving object with a velocity of 10 -30 degrees per second, just like when we gaze at a flying plane in the sky [8]. The most prevalent gaze interaction technique based on smooth pursuit is called Pursuits [15][16][17] that measures a similarity between a participant's gaze and moving objects trajectories based on their correlation value. As a consequence, the spatial accuracy of the gaze does not matter. Pursuits indicates that prior eye tracking calibration is not essential; hence, the smooth pursuit-based technique is more suitable for spontaneous gaze interaction.
A number of studies have been focused on smooth pursuit-based interaction, particularly various techniques for object selection purpose. The most common technique is based on correlation value, the Pearson's Product-Moment Correlation (PPMC) [3,[14][15][16][17][18][19][20][21][22]. Other various approaches are Euclidean Distance, deep learning, and 2D correlation [13,23,24]. Object selection based on Euclidean Distance has achieved sufficient accuracy [13]. The technique is also relatively simple to be implemented as it requires no denoising method [14]. However, object selection using Euclidean Distance is sensitive to spatial accuracy, i.e., eye tracking calibration plays an important role. In comparison, object selection based on deep learning achieves high accuracy and is insensitive to spatial accuracy [24]. Yet, deep learning needs a training phase that requires training data, making it no longer simple in terms of implementation. Similarly, correlation-based object selection techniques generally need an additional signal denoising method despite their sufficient accuracy and insensitivity to spatial accuracy [14,23]. However, both deep learning and correlation-based object selection techniques are lacking in terms of implementation simplicity due to the need of training phase and denoising, respectively.
To remedy the above-mentioned gaps, we then proposed a novel object selection approach based on difference pattern named Difference Gaze Pattern method (DGP), as summarized in table 1. We applied a linear regression on the difference pattern between gaze and object trajectories and compared the performance of our proposed method with PPMC in terms of accuracy and success time as both methods require no training phase. We also evaluated the effect of calibration and signal denoising method to justify whether our proposed method was robust to spatial accuracy and simpler in terms of implementation.

Dataset
For experimental purposes, we used the similar dataset as used in previous studies [22,25]. The dataset were gathered from 34 participants, each of whom underwent two conditions, i.e. uncalibrated and calibrated condition. The uncalibrated condition here was defined as a condition where the participant performed a task without prior eye tracking calibration. On the contrary, if the participant went through eye tracking calibration before performing the task, the data were categorized as a calibrated condition.
In each condition, a participant was required to perform a task according to the stimulus as presented in figure 1. The stimulus consisted of four moving objects with a speed of 142 pixels/second and the display had a 1920×1080 pixels resolution and each object had a size of 77×66 pixels. Object #1 moved horizontally from (250,100) pixels to (1600, 100) pixels, object #2 moved vertically from (100,100) pixels to (100, 950) pixels, object #3 moved horizontally from (1600, 950) pixels to (250, 950) pixels, and object #4 moved vertically from (1750, 950) pixels to (1750, 100) pixels. Every 10 seconds, one of the objects was colored orange alternately. The participants were tasked to follow the movement of the orange object with their eyes.
A Tobii EyeX controller eye tracking sensor was used to record gaze coordinates on the screen when following orange objects. The sampling frequency of the sensor was 70 Hz. The eye tracker was mounted beneath a 22-inch LED monitor displaying the stimulus and the participant was seated 50 cm in front of it (see figure 2). A personal computer with Intel Core i3-6100 3.7 GHz processor, 8 GB RAM, and Windows 10 Pro 64-bit operating system was used to accommodate the recording process. Visual Studio Community 2017 with C# programming language was used to display the stimulus and record the participants' gaze coordinates. The gaze coordinates were stored along with the four moving objects coordinates for each timestamp. Information regarding the active button in which object was colored orange at the time was also provided in the dataset for validation purposes. Our current study utilized a personal computer with Intel Core i7-1165G7 2.8 GHz processor, 8 GB RAM, and Windows 10 Home operating system. Visual Studio Community 2019 was used as the integrated development environment software. C++ programming language was used to evaluate the performance of the object selection methods.

Object Selection Methods
We defined the object selection method as an algorithm to figure out which moving object was followed by the participant's gaze at particular period. Object selection was performed by measuring a similarity between gaze and object trajectories. Therefore, the system could perform an appropriate action once it recognized the selected object. In this study, we compared state-of-the-art object selection method [22] based on Pearson's Product-Moment Correlation (PPMC) with a new approach based on the pattern of the difference between gaze and object trajectories.
Conceptually, a gaze trajectory is defined as a sequence of n gaze points or coordinates written in the following mathematical expression: represents the spatial position of the -th sample of gaze on the screen. There are also a set of moving objects A moving object trajectory is a sequence of coordinates of -th object on the screen with In this current study, we used = 4 as there were four moving objects in the dataset. We also empirically set = 180 samples or in terms of duration, it was more or less 2.5 seconds, almost similar to the duration of the window used by Khamis et al. [19].

Pearson's Product-Moment Correlation (PPMC)
PPMC measures a similarity between gaze and object trajectories in the form of correlation values ranging from 0 to 1. A correlation value of '1' indicates that both trajectories are exactly alike, whereas '0' implies no similarity between both trajectories. We specifically used correlation for -axis coordinates if the object moved horizontally and -axis coordinates if it moved vertically. For instance, when the object moved horizontally, we calculated the correlation value between gaze and the -th object as follows: To decide which object was followed by a participant, the algorithm calculated the correlation value between each object and the gaze trajectories. Therefore, there were four correlation values for each instance. Objects with a correlation value of more than a threshold value would be labeled as a candidate. When there was more than one candidate, the algorithm selected an object with greater correlation value. In the experiment, we set a threshold value of 0.7 as suggested by Khamis et al. [19]. As new gaze coordinates came, the correlation window shifted and the same procedure was executed. The algorithm decides the selected object once there are at least 80 consecutive correlation values of the same object satisfying the threshold value. This is to ensure that sufficient time was given to the participant for making selection progress.

Difference Gaze Pattern (DGP)
Different from PPMC, our new approach calculated the difference between both gaze and object trajectories, applied linear regression, and defined its slope or gradient value. We also determined a threshold value to decide the selected candidate. Figure 3 shows an illustration of our approach. The difference between each object and gaze trajectories was calculated resulting in new trajectories. We then applied linear regression and calculated the slope value. When the gaze followed a certain moving object, the difference should be constant, unless noise was present. Applying linear regression on the difference trajectory will generate a linear line whose slope is close to zero. On the contrary, when our gaze did not follow a certain object, the slope would be far from zero.
Our study used -axis coordinates to calculate the difference between gaze and object trajectories in the horizontal direction. On the other hand, -axis coordinates were used if the objects move in the vertical direction. Suppose there is an object trajectory as written in (3) that moves horizontally and a gaze trajectory stated in (1). Since they moved horizontally, we calculated the difference of each coordinate of the -th sample ( ) as follows: Subsequently, we used (5) to calculate the slope value of the difference pattern between gaze and the -th object that moved along -axis direction as follows: We empirically used a threshold value of 2.0. When the absolute value of | ( , )| was less than 2.0, the object was treated as candidate. When there was more than one candidate, the algorithm chose the object with the least slope value. The other procedures were similar to those applied in PPMC. The method implementation is summarized in figure 4.

Signal Denoising Method
In our previous study, signal denoising was required to improve object selection accuracy as it affected the correlation value [22]. In the current study, we also investigated the effect of signal denoising on the performance of object selection using both PPMC and Difference Pattern. We observed a simple and classic denoising method, the first-order infinite impulse response (IIR) filter.
Our study used first-order digital IIR filter defined as follows: where ( ) denotes the current -th gaze point, whereas ̂( ) and ̂( −1) present the current -th and previous ( − 1)-th denoised gaze points, respectively. We used = rad/s, the same value used in the previous study [22].

Performance Metrics
To evaluate the performance of our proposed method, we used two metrics namely accuracy and success time.

Accuracy
In this study, we defined a task as a procedure when a set of gaze (1) and four object trajectories (3) were calculated to obtain a selected object candidate. Once a new gaze point arrived and the sliding window had moved, another new task was performed. On the other hand, the dataset provided the information of the active button that was followed by a participant's gaze at each timestamp. Thus, the active button acted as the ground truth for the experiment. The selected object candidate of each task was compared with the ground truth. When the selected candidate matched the ground truth, the task was considered a successful task and vice versa. We measured the accuracy as a percentage ratio between the number of successful tasks and the number of total tasks. The accuracy implies that the probability of the system can correctly 'guess' the object selected by the participants.

Success Time
Besides accuracy, another important metric to measure the performance of the object selection method is success time. Success time indicates how long a participant has to wait until the targeted object is selected. In this study, we measured the success time by calculating the difference between the first timestamp of gaze point the last timestamp when an object was decided to be selected by the system.

Performance of Object Selection Methods
We evaluated our proposed method by comparing its performance with previous study [22]. Since we also aimed to observe the effect of denoising, we compared their performance in two conditions, i.e. with and without a signal denoising method.

Without Signal Denoising Method
In this condition, we applied both object selection methods without any prior signal denoising process. We also conducted a statistical test to evaluate the difference between conditions. Most of the distributions of performance measurement results were not normally distributed. Thus, we used Wilcoxon statistical test. Figure 5 presents the comparison of the accuracy between both methods, each of which was in two different calibration conditions. In general, the accuracy of object selection using DGP was significantly higher than using PPMC and it was applied for both calibrated ( = −4.505, < .05) and uncalibrated ( = −4.915, < .05) conditions. It meant that in a system with no denoising step, the DGP method was superior. On the other hand, there were no significant differences between calibrated and uncalibrated conditions in both PPMC ( = −1.599, n.s.) and DGP ( = −1.281, n.s.) methods. In other words, both PPMC and DGP were robust to spatial accuracy. Thus, performing eye tracking calibration before selecting an object may not be necessary. Figure 6 shows the comparison in the performance between our proposed and previous methods in terms of success time. The success time of object selection using DGP was generally lower than PPMC, although it was significant only in uncalibrated condition ( = −2.966, < .05). Spatial calibration also did not significantly lower the success time duration in both PPMC ( = −0.077, . .) and DGP ( = −1.445, . .). Therefore, prior eye tracking calibration may not be needed and DGP performed better than PPMC in a system with no signal denoising process. This success time perspective was also in line with the accuracy results. Figure 7 shows the comparison of accuracy in different object selection methods and calibration conditions while using first-order IIR as a denoising method. Here, there were no significant differences of accuracy between calibrated and uncalibrated conditions in both PMMC ( = −0.043, . .) and DGP ( = −1.325, . .). Compared with the results presented in Fig. 5, there was the significant improvement of accuracy of object selection using PPMC both in calibrated ( = −3.838, < .05) and uncalibrated ( = −4.505, < .05) conditions. This was because PPMC demanded a similar trajectory between gaze and object, where signal denoising held an important role. On the contrary, the accuracy of object selection using DGP significantly decreased with the application of the same filter, both in calibrated ( = −4.095, < .05) and uncalibrated ( = −4.163, < .05) conditions. We assumed that some important information needed in the selection method may be lost due to signal denoising. Hence, the accuracy decremented. Based on these experimental results, PPMC needed a signal denoising method to achieve higher object selection accuracy while DGP did not require any additional signal denoising. Unlike PPMC, there was a significant increase of success time in both calibrated ( = −2.761, < .05) and uncalibrated ( = −4.522, < .05) conditions when we used DGP for selecting an object. The results implied that applying signal denoising before performing object selection using DGP was unnecessary, rather it worsened the performance.

Discussion
Based on the evaluation results, there are some things worth noting. Firstly, both PPMC and DGP methods are robust against spatial accuracy. Therefore, both methods are useful for spontaneous gaze interaction without prior eye tracking calibration. Secondly, there are some advantages of DGP over PPMC as an object selection method. DGP achieved comparable accuracy to PPMC without additional signal denoising method, i.e., comparable accuracy to the previous studies [14,22,25]. Hence, DGP simplified the overall object selection process. It also achieved a better success time compared with PPMC. Thus, our proposed method may be worth to be evaluated further in real-time experiment with more complicated tasks, such as previous studies [3,15,18].
On the other hand, despite its advantages over PPMC, our proposed method is still inferior in terms of computational time as shown in table 2. Our proposed method's computational time was found significantly higher than PPMC ( = −5.086, < .05). DGP may be better in terms of accuracy and success time, yet it lacked in terms of computational time. Nonetheless, although its computational time was significantly higher than PPMC, it still could be implemented in a real-time manner. The Tobii EyeX Controller had a sampling period of 14ms, while the proposed algorithm needed 0.5346ms for completing a task. It means that in a real-time situation, the proposed method can be executed without lowering the sensor's sampling rate.

Conclusion
Spontaneous gaze interaction does not require prior eye tracking calibration for each participant to interact with an application. Gaze interaction based on smooth pursuit can comply with the requirement. Object selection techniques in smooth pursuit-based interaction have been studied over recent years including Pearson's Product-Moment Correlation (PPMC), Euclidean Distance, deep learning, and 2D correlation. However, those techniques have some drawbacks either in terms of spatial accuracy or implementation complexity. Our proposed method based on a difference pattern was able to achieve sufficient accuracy. The method was also insensitive to spatial accuracy and required no additional signal denoising step. Therefore, the implementation of the proposed method is simpler and it is suitable for spontaneous gaze interaction.

Future Work
Our proposed method has only been evaluated in linear trajectories stimulus, whereas PPMC has been evaluated both in linear and circular trajectories [18,21,26]. Further evaluation and some adjustments to the method may be needed in the future to accommodate the different movement types of dynamic objects.