2D Sensor Based Design of a Dynamic Hand Gesture Interpretation System

. A complete 2D sensor based system for dynamic gesture interpretation is presented in this paper. A hand model is devised for this purpose, composed of the palm area and the fingertips. Multiple cues are integrated in a feature space. Segmentation is carried out in this space to output the hand model. The robust technique of mean shift mode estimation is used to estimate the parameters of the hand model, making it adaptive and robust. The model is validated in various experiments concerning difficult situations like occlusion, varying illumination, and camouflage. Real time requirements are also met. The gesture interpretation approach refers to dynamic hand gestures. A collection of fingertip locations is collected from the hand model. Tensor voting approach is used to smooth and reconstruct the trajectory. The final output is represented by an encoding sequence of local trajectory directions. These are obtained by mean shift mode detection on the trajectory representation on Radon space. This module was tested and proved highly accurate.


Introduction
Human computer interfaces (HCIs) gained a steadily growing interest over the last decades.A wide range of applications make use of HCIs, some examples being: remote control [1], virtual desktops [2], robot control [3], and many others.Hand gestures are a powerful means of nonverbal communication among humans and one of the most widely used means of interaction between humans and computing systems.There are two different types of hand gesture based HCIs.One is focused on interpreting static hand poses and the other one is based on dynamic hand gestures.Our proposal refers to dynamic gestures HCIs and their applications.As we will see in the following, our gesture recognition algorithm also, exploits the robustness of dynamic inputs (relative to static gesture interpretation).Recently, various proposals [4,5,6] use 3D sensors for hand gesture recognition.However, there are numerous applications where the use of such sensor types is not possible.Moreover a working 2D sensor based hand gesture recognition system represents a cheaper solution than the ones using 3D sensors.The proposed system is intended for indoor applications that use 2D cameras.Even if the extension of the approach to a 3D sensor use is trivial, we intend here to prove that an efficient design is possible using but a 2D sensor (for the sake of having a cheaper hardware implementation).The first stage in a HCI system is to construct an appropriate hand model.Secondly, the gesture recognition algorithm is to be designed accordingly to the used hand model.The gesture recognition stage has the task of extracting and labeling groups of features obtained from the hand model.There are two main directions concerning the design of a hand model: appearance based and model based.Appearance based methods search for a mapping between extracted features and different hand postures.Generally, appearance based methods [7] are best suited to detect hand poses from a limited set of vocabulary.Model based approaches [8,9], fit a synthetic hand model to the observed hand.The main downside of this approach consists in its high computational cost.Our system proposes a simpler model based related design.We find that for dynamic based gesture interpretation it is sufficient to have the hand modeled by the fingertips and the palm area.One can argue that detecting fingertips could suffice for such cases.We added the palm area to the model in order to increase the robustness by imposing a spatial constraint imposed the two components.Adequate features have to be extracted for model components detection.A very common and robust approach involves the use of multiple cues.Stenger [9] for example, successfully uses color and shape as cues to hand detection in a model based framework.Our approach uses skin tone, shape, edges and motion information as cues for hand modeling.As for the gesture interpretation part, our approach focuses on the case of trajectory segmentation.The trajectory is represented by the temporal collection of extracted fingertips.Some of the previous proposals in this direction use the condensation algorithm [10] or clustering of hidden Markov states [11] to segment the trajectory.However, these approaches are not designed specifically for dynamic hand gestures, and some of the problems encountered concern pause detection.Our model gives us the ability to define a small set of static gestures which are used to treat pauses between consecutive gestural inputs.We aim to design a suitable trajectory segmentation approach, resulting in an encoding sequence.Different actions to be performed by the computer can be assigned to such sequence.Various approaches that treat dynamic hand gestures concern the use of hidden Markov models (HMM) [12,13], dynamic Bayesian networks [14] or the more recent motion divergence fields [15,16].Approaches using HMM present the drawback of compromising between training effort and a better classification rate.The proposals involving dynamic Bayesian networks lack in computational efficiency when training the model.The motion divergence field approach relies on optical flow computation that can prove not robust enough when the scene includes other moving or skin tone colored objects.Also, the method needs a predefined database for a matching stage of the processing chain.The efficiency of the approach is dependent on the database and adding a new gesture to be recognized will need a serious update of this database.We surpass these problems by designing a new model of the hand based on simple geometric features; and an online and versatile gesture recognition algorithm that allows constructing a large set of dynamic gestures.Another novelty and advantage of our proposal is that the interpretation system is not limited to a predefined set of gestures.Any new trajectory is easily labeled and can be straightforward incorporated in the human-computer communication protocol.The rest of the paper is structured as follows.Next three sections detail our design.First an overview of the architecture is presented, then, the composing blocks are detailed: the hand model extraction block followed by the gesture recognition algorithm we have designed.We proceed by a section dedicated to experiments in order to validate the design block by block.The final section is dedicated to some concluding remarks.

System Description
Overview.The system architecture is presented in Fig. 1.The general framework is represented in Fig. 1a.A region of interest (ROI) strategy is adopted in order to reduce the computational load.The ROI parameters are estimated in the hand model detection stage.The input for the gesture recognition algorithm consists of a sparse temporal collection of fingertip locations, also obtained from the hand model.Fig. 1b details the hand model detection stage.First the cues are extracted.Our cues are represented by line segments extracted from binary maps in horizontal/vertical scans.We use both horizontal and vertical scans in order to achieve hand orientation invariance.Only line segments from one of the two scans are used, the one that holds the majority of valid such segments.The binary maps involved are represented by the foreground map, edge map and skin tone map, thus resulting in three types of cues.An important fact is that the same chain of Fig. 1b applies to both fingers and palm area.The only difference is in scale.The entire set of extracted line segments forms the sparse feature space.We use relaxed detectors in order to obtain all valid line segments.The outliers are filtered out in the next stage.Here we apply a mean shift mode detection [17] to estimate the length of the line segments.This estimate is used to filter the next frame associated feature space.Another filtering stage consists in imposing relative location constraints between finger and palm related line

554
Interdisciplinary Research in Engineering: Steps towards Breakthrough Innovation for Sustainable Development segments.For example, if the hand is vertical, only finger and palm line segments that are directly above, are allowed.The segmentation process searches for groups of vertically adjacent line segments, in a top to bottom scan.Having segmented the palm and fingers the hand model is complete.Fig. 1c presents the gesture recognition algorithm.One active finger is used to input locally linear trajectories, composed of its fingertip locations.Next, the sparse set of trajectory points is filtered by a tensor voting technique [18], to obtain a continuous trajectory.The segmentation is carried out in Radon transformed space of the trajectory, by local maxima detection.The maxima in this space correspond to local orientations of the trajectory.Obtaining these points of local maximum implicitly gives the final encoding sequence.We code the trajectory by the succession of local orientations of the composing segments (e.g.E-NW-E).These encoding sequences determine only one dynamic gesture and a computer action can be attached to them (e.g.scroll, rotate, and so on).

Hand Model Extraction
Cue Detection.As mentioned before, we obtain the three types of cues from various binary maps.Cue detection is represented by a simple operation of horizontal/vertical scan to identify the adequate line segments on the binary maps.By adequate we mean that valid line segments are considered only those lying in an interval dependent of the previous estimated length.In our implementation this interval is taken to be [0.5, 1.5] x estimated length.Handling binary maps provides an important reduction in computational time.The binary maps involved, are represented by the foreground map, skin tone map and edge map.The approaches chosen for this purpose are simple, fast and relaxed.The foreground map is obtained with the approach proposed by Kim et al. [19].The approach is proven to be real time, robust and capable of updating the background model.The latter is important for applications that run for long periods of time.The skin map is obtained with a threshold based approach [20].Skin segmentation is conducted in the RGB space.The thresholds are trained on an extensive database of skin tone image samples, which meets our requirement of relaxed detector.Also, using previously trained thresholds significantly increases the computational speed.Finally, for edge map extraction we use the Canny edge detector.For a more stable detection, edge extraction is confined to skin colored foreground areas within the ROI.Furthermore, static edges are cleared out by frame differencing, to avoid confusion from background clutter.
The reunion of all the extracted cues represents the feature space FS:  Feature Space Filtering.An important stage is represented by feature space filtering.Since the feature extraction used relaxed detectors, it is necessary to eliminate the outliers.For this purpose we estimate the segment length by employing a mean shift type robust estimator [17].The estimated length is given by the dominant mode in the feature space.Using the Epanechnikov kernel: Where: The estimated length is obtained by iterating to convergence the following equation: (  (5) Where h is the histogram of lengths from the detected line segments, E g is the derivative of the Epanechnikov kernel, and j s is the mean shift algorithm scale (usually equal to 1).It follows that: ROI Construction.The estimated length corresponds to the finger/palm width.This parameter varies with the user's position and gesturing, relative to the camera.Also, the estimate will influence the ROI.In our implementation we consider a square ROI centered on the extracted palm center and its size about 3 times the palm width.Evidently, the ROI is adapted as a valid hand model is detected.When there is no user interacting with the system the ROI is the entire image.The set of connected primitives represents the segmented finger.The segmented primitive set is then removed from the list and the same procedure is iterated to find the remaining fingers.The algorithm ends when the list is empty or a small number of primitives are left.At the same location multiple primitives can be present in the list, due to multiple cue character of the design.The right primitive for the segmentation is chosen with respect to angle conservation relative to the previous one or the starting marker.

Gesture Interpretation
Using one active finger a dynamic trajectory can be collected, composed of the fingertip locations.We address the case of locally linear trajectories.The encoding sequence is represented by the local directions of the trajectory.We find that a total of 8 directions (N, S, E, W, SW, SE, NW, NE) suffice, and gives the possibility of designing a large set of meaningful gestures.The number of the composing trajectory segments is not limited; however, in our experiments we used up to a succession of 4 oriented segments to compose the trajectory.Trajectory Filtering.First step in recognizing the gesture is to obtain a smooth continuous trajectory from the sparse collection.For this purpose we use the tensor voting technique [18].Its advantage comes from the perceptual based design that is able to naturally reconstruct interrupted lines.Trajectory points are encoded by the following tensor, carrying information on their directional character: Where i i e λ , , i=1,2 represent the eigenvalues and eigenvectors obtained with a principal component analysis (PCA) on the location of the neighboring trajectory points.The first term of the tensor encodes the linear saliency of the neighborhood and the second term corresponds to the isotropic saliency.In the next step the tensor is accumulated in the neighboring trajectory points to reinforce the local directionality.Then, it is propagated and accumulated in the missing locations of the trajectory.The accumulated tensor is given by the following equation: Gesture Interpretation.The final output of the system is the succession of composing segment directions.For this purpose, the trajectory is transformed and represented in Radon space: searching mode.It is to be noted that our implementation is not optimized and the measurements were conducted using a laptop (Intel Core2 Duo CPU at 2,5GHz, 2GB RAM, NVIDIA GeForce 8600M GT and 1GB VRAM) and the incorporated web camera.

Gesture recognition accuracy.
In order to have a fair assessment of the systems' accuracy we need to have an idea about the accuracy of the users input.For this purpose, human inputs of two simple lines oriented at 0 and 45 were collected from several subjects.Using our algorithm we estimated the angle, standard deviation and the error interval.The results are presented in Table 1 These results also motivated the use of only 8 encoding directions, as we see that human input is not that accurate.For comparison, synthetically perturbed trajectories were fed to the dynamic gesture recognition block.In Fig. 7 a plot of the mean estimation error and the standard deviation is shown.Note that the detection error is very low and that human input error is higher, proving that the gesture interpretation block is accurate enough to be used in a HCI application.A final validation test consisted in measuring the recognition rate.Since we assessed the human input accuracy the recognition rate was computed on synthetically generated trajectories in two cases.First case considered severe perturbations of the test trajectories; a higher perturbation with respect to the human input accuracy, and the recognition rate was of 96%.In the second case, we tested the recognition on mildly perturbed trajectories; within the range of human accuracy input.In this case we obtained a 100% recognition rate.These tests show that due to its simplicity and when the human error is eliminated the system performs at its best.Some typical examples of results of the gesture interpretations are presented in Fig. 8, all gestures started in the upper left corner.

Conclusion and Discussion
We have presented a new complete design for gesture recognition to be used in HCI applications.
Our system is intended mainly for dynamic gestures.It is composed from a hand model extraction block, followed by gesture interpretation.The hand model is obtained by detecting the fingertips and the palm area which need to be spatially correlated.Each of the two model components are by segmentation in a sparse feature space.The feature space is constructed from the reunion of multiple cues.Out approach uses foreground, skin and edge cues in the form of binary maps to reduce the computational load.The parameters are estimated by means of robust estimators, making the system very adaptive.A series of tests validate the hand model in challenging situations like occlusions, camouflage and varying illumination.Real time constraints are proven to be met by the system.
As for the gesture interpretation approach, it takes as input a collection of fingertip locations, obtained from the hand model.The output is a sequence that encodes trajectory by directions.The application refers to locally linear trajectories.First a smooth trajectory is constructed by tensor voting filtering.The succession of directions is obtained by mode detection in Radon space.The use of 8 possible directions and not imposing limitations on the trajectory length offers the possibility of devising a large dictionary of meaningful gestures.We should mention that pauses between consecutive gestures can be marked by a static gesture.In fact, the hand model allows a limited number of such gestures, by active finger counting (e.g. 2 active fingers, see Fig. 5).We note also, that the measured response time of the gesture interpretation block is sufficiently low (for gestures up to 4-5 segments) to not encumber user's interaction with the system.Finally, one can argue about the relevance of this work, especially nowadays, when 3D sensors are more and more available.These sensors could spare much of the trouble that 2D sensors create.The answer is simply because it offers a cheaper solution, and moreover, there are numerous applications that do not need an elaborate HCI the 3D sensors could offer.Still, the framework presented here is versatile and a 3D sensor can easily be incorporated.Of course, the use of foreground detection will become obsolete and the related cue can be replaced with depth maps, for example.

FGC
the set of foreground map cues, i SKC the set of skin map cues and i EGC the set of edge map cues.An example of the cue sets from above is presented in Fig.2(gray level).

Fig. 2 .
Fig. 2. Examples of cue sets in gray level, from left to right: Segmentation.For the segmentation we consider the morphological operation of reconstruction by geodesic dilation.The first marker is the middle point of the upmost extracted line segment.By geodesic dilation we detect and group downwards adjacent neighbors, thus obtaining a segmented finger and the fingertip location.If multiple fingers are active we eliminate from the feature space previously segmented line segments and iterate the above operations until all fingers 556 Interdisciplinary Research in Engineering: Steps towards Breakthrough Innovation for Sustainable Development are obtained (empty feature space).For palm area segmentation only one pass is needed and the palm center is retained.Examples of segmented fingers, palm area and the final hand model are presented in Fig. 3.

Fig. 3 .
Fig. 3. (from left to right) Examples of segmented fingers, segmented palm, and final hand model.