Keywords

1 Introduction

Eye tracker technology can be widely used in a variety of situations related to visual psychology by capturing user gaze information [1]. Fixations and gaze points, areas of Interest (AOI) play an important role because they provide semantic information about the area being surveyed or potentially interested [2]. The measured gaze information can be assigned to the AOI as the basis for many statistical [3] and visual [4] evaluation methods. For static source from screen based eye tracker, defining AOI is easier to perform than using mobile eye tracker. For AOIs (or objects of interest) could move, change their size and shape, and even disappear and regain in dynamic scene. Performing a manual annotation with objects of interest is a time consuming processing.

However, with the continuous development of deep learning technology, a variety of target recognition and detection algorithms have been proposed, gradually replacing traditional objects recognition and Detection algorithm (i.e., You Only Look Once, YOLO). Among these algorithms, YOLO [5] is the fastest while preserves high precision. This model uses a single neural network to predict bounding boxes and class probabilities in one evaluation. Then it became possible to calculate the distance between gaze points and objects recognized in the video frames collected in mobile eye tracking examination. To improve analysis efficiency with computer algorithms, Kuno Kurzhals and his group provide a novel visual analytics approach to accomplish the annotation process by image-based, automatic clustering of eye tracking data integrated in an interactive labeling and analysis system [6]. They demonstrate their approach with eye tracking data from a real experiment and compare it to an analysis of the data by manual annotation of dynamic AOIs. This method provides a good idea for using computer vision to assist in the analysis of eye tracker data.

Our main contribution in this article is giving a new visual analytics approach that allows the efficient comparison of data from multiple videos acquired during experiments with mobile eye tracking. By including unsupervised clustering techniques in the pre-processing and interactive image queries in the labeling step of the analysis process, we achieve annotation results comparable to current state-of-the-art techniques, but with far less human effort and a more efficient annotation process.

2 Related Works

User experience research includes the entire process of users using products, and has become one of the core competitiveness of current industrial design. For user experience evaluation methods, there are many research related. Vermeeren compares multiple user experience methods such as questionnaire assessment, self-report assessment, vocal thinking assessment, and physiology-scientific assessment methods [8]. Virpi Roto conducts retrospective analysis of user experience evaluation methods (UXEM) collected by academia and industry, such as observation, eye movement, heart rate, myoelectricity, questionnaires, etc. evaluate user perception [9]. Marchitto explores the application of cognitive psychology and ergonomics in user experience assessment [10]. Ramakrisnan used the eye tracker experimental method to conduct human-computer interaction evaluation of electronic systems [11]. Shin studies the user experience research model in the 3D virtual environment. The cognitive sensation and usage behavior of the user in the experimental questionnaire are recorded in the experiment, which verifies the important role of user cognition and perception in the overall experience of the user [12]. Vaananen, Obrist, et al. discuss about user experience base on “analysis of behavioral value of usability”, “user experience and user acceptance”, “experience in product innovation design”, “how to choose user experience method” and “user research theory and its practical application” [13,14,15,16].

Eye tracking is considered to be the most effective method of visual information recognition and processing for human-computer interaction. Eye tracking is also an important method of assessing the user interface in the field of work. Through the visual cleverness of how to search and locate the user in the determined task, the deep cognitive mechanism of the user’s eye movement behavior can be analyzed, and the human behavior of the user interface such as vision, cognition and attention can be deeply studied [7].

In the previous eye tracking research, the user’s eye movement behavior indicators are as follows; such as the first gaze time, the first gaze point and the gaze map. For example, Golberg et al. used eye movement data evaluation methods to explore the differences in response of users to different user interface materials. The eye movement data coordinates and the saccade path indicators are used to measure the interface quality [17]. Augustyniak et al. proposed a new interpretation method which using eye tracking characteristics by collecting the visual inspection behavior of expert users. Through the participation of 17 expert participants and 21 students in visual test tasks, the characteristics of eye movement parameters in the estimation of eye movements were revealed [18]. Ito et al. Uses eye movements with an average gaze ratio to visualize the user’s visual behavior [19]. On the other hand, Burch et al. used the thermograph, visual trajectory and area of interest (AOI) to describe the cognitive process of three different tree visualization charts [20]. Liu et al. used eye-tracking technology to measure the impact of the residual text information on the screen on the user’s perception of the multimedia process. The experiment collected indicators such as the number of fixation points and the duration of fixation, and the average fixation time, to compare the information processing strategies of participants when browsing the web. Participants collected and assessed their cognitive load levels in the task through a self-assessment questionnaire [21]. Zulch studied the relationship between different data expression methods by observing the number of gaze points and the first gaze time. By observing the different behavior strategies of different users in searching data [22]. In general, eye movement data such as Areas of interest (AOI), Heat map, time to first fixation, and fixations before fixation are the main acquisition object of eye tracking research.

The studies presented in this article use mobile eye tracking techniques to evaluate the usability of service systems. Mobile eye tracking is far less constrained, allowing the participants to move freely during an experiment. Nevertheless, mobile eye tracking becomes more and more popular, allowing “in-the-wild” studies that are not possible with restricting experimental settings. Therefore, the development of more efficient analysis methods for these challenging datasets is an important research field. Only few studies investigated how mobile eye tracking techniques can be used in real world scenarios. Schuchard et al. examined the ideal sign placement for patients with dementia in a nursing home [23]. The group of participants was restricted to persons who need special assistance, no navigation aid was used and the study was very exploratory. Similarly, Pinelo da Silva used mobile eye tracking to examine visual cognition and way finding in the city of London but no navigation aid was included [24]. Delikostidis used eye tracking to examine pedestrian navigation systems during his experiments [25]. He conducted a field study in an outdoor area showing different mobile map depictions.

As a summary, eye tracking is a well-established measurement technique in research fields of variety aspects. However, as already stated above, until now it is rarely used to evaluate the usability of a service system (specially, a government service department). In this article, we filled this methodological gap by measuring the users’ gazes and objects of interest (or the association mappings between them) during a real-world navigation experiment that examines the suitability of specific task flow in a service system.

3 Domain Problem Characterizations and Design Process

3.1 Government Service Department Task Flow Characteristics

The local taxation and taxation authority is one of the state’s public administration institutions. The quality and efficiency of taxation services provided by the taxation administration directly affects the image of the government. More important, the management skills in organizing tax revenue directly affect the government’s fiscal revenue [26]. The tax service hall, as the mainly contact to the pubic for the local tax authorities to handle the business externally, its law enforcement ability and performance are related directly to the attraction to taxpayers. However, the public still feeling that there is still a big gap between the attitude and management effectiveness of government agencies and their expectations. They are not only concerned about whether the government has fulfilled its due responsibilities, but also concerned that the government is fulfilling its duties. Whether the responsibility is efficient and convenient, and whether they can get a fast, effective and satisfactory service response when they go to the government department or seek help. In this context, my team cooperated with the Local Tax Service Hall of Yuexiu District, Guangzhou (P.R.CHINA) to use the mobile eye tracker as Experimental instrument. We design a corresponding test plan and performing actual user testing follow the service flow in the tax service hall, aid to objectively evaluate the satisfaction of the taxpayers with the service quality of the service system.

3.2 Implementation Process Introduction and Analysis

After preliminary investigation, we found that the tax administration business could classify with five categories: declaration, correction, tax registration, document acceptance, and punishment. And the applicants are divided into enterprises, private industrials, flexible employment and natural persons. Taxpayers with different identity will involve different types of tax administration business, and related tax business processes would be very different. For example, the process for “tax withholding registration” shows in Fig. 1 and the process for “Examination and approval of VAT deduction vouchers deduction” shows in Fig. 2.

Fig. 1.
figure 1

The steps process in the flow of “tax withholding registration”.

Fig. 2.
figure 2

The steps process in the flow of “examination and approval of VAT deduction vouchers deduction”.

Obviously, different business must be carried out according to the established process, and each process is quite different. This is the contradictory two-sided feature of the services provided in the government service hall: on the one hand, the user is required to complete the cumbersome established process to achieve his goal; on the other hand, the user is as a consumer to be serviced, and the service provider wishes to provide better service to increase their satisfaction. In such a special scenario, we must design a reasonable evaluation method to understand the user’s pain points more accurately, so as to give reasonable suggestions for how to improving their satisfaction. In this case, mobile eye tracker gave a perfect solution as an instrument to record the whole process with the user’s first perspective.

Then, we have to abstract the various business processes of the steps into one unified task process for the following three reasons. First, our purpose is to evaluate the satisfaction of the taxpayers, so the experiment we design should stand on the user experience as the basic dimension, and the detail of business is not our concerned. So we could ignore the difference in detail and merge some steps of those processes. Secondly, for the participants in this experiment are real users, they come to the office to have their own purposes and will not follow the same process. That is, collecting enough statistics for the same process would be very time-consuming and would be too expensive for our budget. Therefore, we need to summarize the task flow of different users into an abstract unified process, in order to calculate the test data from different test samples, and then obtain objective evaluation values. Third, the test data collected by mobile eye tracker is combining with video stream and gaze points’ position record, which is some kind of unstructured data. It is one of the methods for effectively extracting structured data by dividing the video stream data into units by their duration. For example, the difference time consumed in stages can be counted and comparison.

3.3 Analysis Process

As mentioned above, to summarize all service processes into a unified interactive task sequence is the footstone of subsequent data analysis. After manual analysis to all recorded data, we summarize the seven key task sequences, shown in Fig. 3. With this task sequence as the statistical latitude, more in-depth analysis could be implement. Basically, we could count up the execution time of each task node and various types of eye movement analysis data to obtain a user satisfaction evaluation score.

Fig. 3.
figure 3

Tax service process task sequence.

4 Eye Tracker Data Processing

4.1 Experimental Design

We used a head-mounted eye tracker to perform a mobile eye tracking experiment to record eye movement data of participants during business process. Among the people who came to the service hall for tax administration business, we randomly selected four different identities (enterprises, private industrials, flexible employment and natural persons) for testing. During the experiment, participants were asked to wear a head-mounted eye tracker at the entrance and complete their business process with the eye tracker until they are finish. They need to complete the process of taking a number, looking for the counter, and at the counter without any special help. Since the selected head-mounted eye tracker is very light, the participant can subjectively ignore the presence of the eye tracker for a short time and remain relaxed during the experiment. In addition, in order to record the actual fact of how the participants completing the business process, test assistants will not communicate with them at all until the end.

We completed a total of 35 tests in three days. After exporting the eye tracker data, we obtained a total of 1920 * 1080 resolution and a total duration of 892 min of video image data. Some screenshots are shown in Fig. 4, the red circle in the picture is the gaze point.

Fig. 4.
figure 4

Screenshots from video record in the experiment. (Color figure online)

Fig. 5.
figure 5

Tobii Pro Glasses 2, image from Tobii Official website. (Color figure online)

4.2 Apparatus

Wearable eye tracker designed to capture natural viewing behavior in any real-world environment while ensuring outstanding eye tracking robustness and accuracy. It’s possible to combine with biometric devices for even deeper insights into human behavior. This experiment used a Tobii Pro Glasses 2 Eye Tracker (www.tobii.com), which Sampling rate is up to 100 Hz. The Tobii Pro Glasses 2 head unit films what the participant sees and records the ambient sound while moving around. The pocket-sized recording unit saves the gaze data onto an SD card. Figure 6 shows the author calibrating the headset eye tracker for the participant in this test project.

Fig. 6.
figure 6

The author is calibrating the mobile eye tracker for the participant in this project.

4.3 Problems and Solution Strategy

Interpretation of Eye Tracking Mode in Usability Testing In view of the past usability tracking mode usability study, the main statistical parameters are: number of gaze points, gaze ratio (time ratio) of each region of interest, average gaze dwell time, The number of points of interest for each region of interest, the average gaze dwell time for each region of interest, and the gaze rate (gaze points) [27]. The “number of fixation points” means that the total number of fixation points is considered to be an indicator related to search performance, and the “average fixation dwell time” is the average value of the fixation stay time in a period of time, “The average gaze dwell time of each interest area” is the average of the dwell time of an object. Long gaze is considered to be difficult for the subject to extract information from the display interface. If the subject’s gaze on a particular display element is longer, it indicates that it is difficult to extract or interpret the information on the display element. A larger number of gaze points indicate a low performance search. Can be used to display the poor layout of the elements. At the same time, the large value of these data also represents the psychological impatience and conflict of the data users, which also means that the user’s satisfaction begins to decrease [27].

However, as described above, Since the data collected by head-mounted eye tracker are recorded during the user’s movement, the AOIs extract from coordinates of the gaze points could be no specific meaning. If we apply hotspot map, focus map, and gaze map to gaze points in the mobile eye tracker testing, the result tending to focus in the center of the picture, which has no practical significance for further analysis. Obviously, in this case, it would consume a lot time to annotation the association of gaze points and region of interest manually. In order to solve this problem, this article introduces the YOLO V3 algorithm based on convolutional neural network, which could quickly identify the objects in the video image and gave the corresponding coordinate regions. By comparing the coordinates with gaze points, it is possible to automatically determine whether user’s attention stays on an object, and further statistic to objects of interest with the user’s attention.

4.4 YOLO Network Architecture

YOLO (You Only Look Once) is an end-to-end convolution neural network commonly used in object detection recognition. The main features of YOLO include high speed and accuracy. Redmon and others developed The YOLO (You Only Look Once) algorithm at 2016 [5]. It is a regression-based target recognition method. By 2018, it has developed to the third generation YOLO V3 [28]. Just like its name, it only needs to do a forward calculation to detect a variety of objects, so the YOLO series algorithm is super fast. YOLO V3 still maintains the fast detection speed of YOLO V2 [29], and the recognition accuracy is greatly improved, especially in the detection and recognition of small targets. YOLO V3 draws on the idea of residual neural network, introduces multiple residual network modules and uses multi-scale prediction to improve the defect of YOLO V2 for small target recognition. For it has high detection accuracy and high speed, it is one of the best algorithms for target detection. This model uses a number of well formed 3 * 3 and 1 * 1 convolutional layers, and some residual network structures are used later for multi-scale prediction. In the end it had 53 convolutional layers, so the author also called them Darknet-53. YOLO V3 introduces the idea of using anchor boxes [30] in Faster R-CNN, and uses 3 scales for COCO datasets and VOC datasets to predict each scale. There are 3 anchor boxes [30] for each scale, and the feature map uses a small predict box. And the network structure can be modified according to the scale of the prediction of the preparation.

4.5 Processing

Our goal is to design a method and develop related algorithms and build a system to automatically counting the number of gaze points, fixation duration, average fixation duration, gazing objects. First step is to split each video record into sub-records according to the task sequence shown in Fig. 3. Then apply YOLO V3 algorithm on each sub-record and obtain all coordinate of box of object shown out in each frame (technically, sequence of coordinates). Finally, to develop corresponding compare algorithms for different situations, generating statistical data automatically as output. The complete processing steps are shown in Fig. 7. And some screenshots from processed results after applied YOLO V3 algorithm to videos recorded by mobile eye tracker are shown in Fig. 8.

Fig. 7.
figure 7

We create an analysis process for mobile eye tracking data. (Red: video and gaze point coordinate sequence. Yellow: manually segment the video according to the process node, and classify the processing content to give the corresponding statistical algorithm. Green: apply YOLO V3 algorithm and generate statistical data: number of gaze points, fixation duration, average fixation duration, gazing objects.) (Color figure online)

Fig. 8.
figure 8

From left to right, from top to bottom are as follows: get a number, filling a form, finding counter, at counter.

As shown in the bottom left picture of Fig. 8, after the annotation to video contents by the YOLO algorithm, some extra objects were identified. If these objects are given to next step for counting, they will become some kind of interference data. So we need to pre-process (yellow mark in Fig. 5) those videos, mark the objects that need to be counted, and give them to the statistical program for statistics. According to the above evaluation strategy and the actual situation of the experimental process, the data collected and the evaluation criteria are shown in Table 1. The eye tracker automatically collects the experimental data, and all the values are averaged after the data is generated, and the data is retained one decimal place.

Table 1. Data collection and evaluation criteria

Table 2 is for the average number of gaze points at each stage of the business process. The data shows that there are higher gaze points in the two stages of “filling in the form” and “waiting”. Explain that these two steps have produced significant congestion and is one of the factors that reduce user satisfaction.

Table 2. Duration and average number of gaze points

The user’s attention at each stage of the business process and the corresponding average gaze duration are shown in Table 3. By analyzing these data, we can get some interesting conclusions:

  1. (1)

    When the user is in the “waiting” phase, the attention time for the “mobile phone” suddenly rises, while at other stages it is almost zero. It can be judged that the user is “busy” at other stages, and the user is in a state of psychological tension. Therefore, if the consulting service can be added at various stages, assisting the user to complete the business as soon as possible will effectively improve the user’s satisfaction.

  2. (2)

    From the comparison of the attention values of “display” and “queue ticket”, it can be seen that the user pays attention to the “queue ticket” at the same time when paying attention to the “queuing ticket”, and the two have the relevance of attention. The reason is that while user is in the stage before the “At Counter”, the row number information provided on the “queue ticket” needs to be compared with the calling number information provided on the screen. It is also a design direction to improve user satisfaction by reducing the user’s attention time for both of them.

  3. (3)

    The attention time for “Ad. Material” is obviously unreasonably little, and even insufficient attention has been paid in the “entry” stage. Consider adjusting the slab design or placement of the advertising material to get a higher profile.

Table 3. Object of interest and average fixation duration (s)

5 Conclusion

User-experience assessment using a mobile eye-tracker (especially in the scenarios that user could walk around) is more objective and more logged than traditional ways, such as user interviews, questionnaires, observations, etc. In addiction, the evaluation of the latitude is more abundant; the test sample is small and so on. However, due to the particularity of the experimental method of the mobile eye tracker, it is necessary to solve the problem of how to analyze the image data with moving object of interest. In this paper, the YOLO v3 object recognition program is used to automatically recognize the coordinates of the objects and compare them with the coordinates of the gaze points. The valuable statistical data could be quickly obtained automatically by the processing method we gave.

This article proposes a new method to evaluate the user satisfaction of the service system. Although it is not rigorous enough from the statics data, but it provides a new idea to analyze data recorded by mobile eye-tracker. With this idea, we could design more automated test methods, for example, pre-position specific patterns in the executed space of each stage, so that the object recognition algorithm could automatically recognize these patterns, to let the analyze system could automatically divide each task phase. Or, training the YOLO algorithm to recognize a specific object (for example, the “person” could divide into “visitor” and “staff”), and the statistic data with marking specific objects could be used for subsequent effective analysis.