An Evaluation Survey of Score Normalization in Multibiometric Systems

Multibiometric fusion is an active research area for many years. Score normalization is to transform the scores from different matchers to a common domain. In this paper, we give a survey of classical score normalization techniques and recent advances of this research area. The performance of different normalization functions, such as MinMax, Tanh, Zscore, PL, LTL, RHE and FF are evaluated in XM2VTS Benchmark. We evaluated the performance with four different measures of biometric systems such as EER, AUC, GAR(FAR=0.001) and the threshold of EER. The experimental results show that there is no single normalization technique that would perform the best for all multibiometric recognition systems. PL and FF normalization outperform other methods in many applications.


Introduction
Biometric recognition refers to the use of distinctive physiological or behavioral characteristics for automatically confirming the identity of a person. Multibiometrics which combines more information is expected to improve the performance of biometric system efficiently. Depending on the level of information that is fused, the fusion scheme can be classified as sensor level, feature level, score level and decision level fusion [1]. Apart from the raw data and feature sets, the match scores contain the richest information about the input pattern. Also, it is relatively easy to get and combine the scores generated by biometric matchers. Consequently, score level fusion is the most commonly used approach in Multibiometric systems. Scores generated by different matchers are not homogeneous often. For example, scores of different matchers may not be on the same numerical range and may follow different probability distributions. Therefore score normalization which transforms these scores into a common domain before fusion is needed. This paper will give an overview and comparison of score normalization methods in multimodal fusion.
The remainder of this paper is organized as follows. Section 2 introduces the Fusion in multimodal biometrics include the ideal normalization function, the performance measure and the combination rules. In Section 3, several score normalization techniques are introduced include classical and the advances of normalization methods. To study the effectiveness of different normalization techniques, section 4 gives the experimental results. The last section summarizes the results of this work.

Fusion in multimodal biometrics
The Ideal Normalization Function. In this paper, matching score coming from samples of the same individual is noted as genuine score while that coming from samples of different individuals noted as imposter score. Since scores from different recognition systems are not comparable, the normalization step tries to find the function which can transform the scores into the common domain and make the the scores of different matchers comparable. The ideal normalization function is the posteriori probability functions which is given by ( | ) / ( ( | ) ( | )) ideal s p genuine s p impostor s p genuine s = + (1) ( | ) p genuine s and ( | ) p impostor x refer to conditional density of the matching score being that of a genuine user or impostor user. It is difficult to estimate the density of matching scores in that they may not obey a certain distribution model. Therefore the ideal normalization function is not easy to implement. And different normalization techniques have been proposed in literature to solve this problem. A good normalization method should be robust and insensitive [1]. Robustness refers to insensitivity to the presence of outliers and Efficiency refers to the proximity of the obtained estimate to the optimal estimate when the distribution of data is known.
Performance Measures. Let us denote with t an acceptance threshold so that users whose score is larger than t are assigned to the genuine class, while users whose score is smaller than t are assigned to the impostor class. The two errors, respectively the False Rejection Rate (FRR), and the False Acceptance Rate (FAR) are defined as follows.
The Genuine Accept Rate (GAR) is the fraction of genuine scores exceeding the threshold t. Therefore GAR=1-FRR. The most widely accepted method used to evaluate the performance of a biometric system is the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the GAR (or FRR) against the FAR. The Equal Error Rate (EER) is the point of the ROC curve where the two errors, i.e. the FAR and the FRR, are equal. EER is widely used in the biometric field to assess the performance of biometric systems. GAR(FAR=0.001 or else) is another performance measure which is also widely used in biometric performance evaluation [1]. In ROC analysis the Area Under the Curve (AUC) [2] is the also used evaluate the performance of a two-class system because it is a more discriminating measure than the accuracy. In biometric recognition systems, we always try to make EER smaller and GAR(FAR=0.001) as well as AUC larger. Combination Rules. After normalizing the matching scores and then we need to acquire a new score through a certain combination fusion rule to make final decision. Kittler [3] et al. proposed a general fusion theory framework and deduced five basic fusion rules: Sum, Product, Max, Min and Median. Since Sum rule works better in most applications [4], we use Sum rule to get the final mark in our experiments to evaluate the performance of the normalization techniques.

Score normalization schemes
Several classical score normalization techniques such as MinMax, Tanh, Z-score, Median, Median/MAD and Decimal Scaling have been described in Ref. [1]. Among the classical normalization techniques, Median/MAD and decimal scaling are not robust and Efficiency, therefore, we choose MinMax, Tanh and Z-score in the experiments in Section 4. Then we describe the progress of normalization techniques in recent years. In this section, let X, X G and X I denote the set of raw matching scores, genuine scores and imposter scores of training data. And let s denotes the new score which associated with the same matcher. The normalized score of s is then denoted by s′ . , , , Max Min Median and µ σ are the maximum, minimum, median, mean and standard deviation values. Piecewise linear (PL) [5] normalization technique transforms the scores in the range of [0, 1]. The normalization function of PL maps the raw scores using piecewise linear function as, 0 min( ) Four Segment Piecewise-Linear(FSPL) [6] technique divides the regions of impostor and genuine scores into four segments and map each segment using piecewise linear functions. The scores between two extremities of the overlap region are mapped using two linear functions separately in range of [0, 1] and of [1,2] towards left and right of t, respectively as equation (4). min( ) 0 ( min( )) / ( min( )) where (max(X )<t<min(X )) 1 ( Linear Tanh Linear(LTL) [6] normalization technique takes the advantage of the tanh estimator and the PL normalization. Normalization function of LTL maps the non overlap region of impostor scores to a constant value 0 and non overlap region of genuine scores to a constant value 1. The overlapped region between max(X I ) and min(X G ) is mapped to a nonlinear function using tanh estimator as,  [7] is derived from min-max normalization scheme. The idea behind RHE is based on following observations: Any kind of normalization always causes loss of information content. Multimodal biometric systems suffer mainly from the 'low' genuine scores instead of 'high' impostor scores. So the RHE normalization method is given by

Experimental Results
Database. The XM2VTS-Benchmark [9] database consists of five face matchers and three speech matchers and was partitioned into training and evaluation sets according to the Lausanne Protocol-1(LPI). The benchmark of LPI includes two files, one is dev.label and the other is eva.label. We use dev.label as training data and eva.label as test data. Our experiments are conducted based on this match score benchmark. We sign the face matcher as face-1, face-2, face-3, face-4 and face-5 and the speech matcher as speech-1, speech-2 and speech-3 respectively.
Experimental Results. We conducted experiments to measure the benefits between the 7 normalization methods: MinMax(MM), Tanh, Zscore, PL, LTL, RHE and FF. The EER of all the matchers can be found in Table 1. As shown in Table 1, among face matchers, matcher face-3 and face-5 gain the best and worst performance respectively. And among speech matchers, the performance order is speech-1, speech-3 and speech-2. The experiments are conducted with 15 kind multimodal combinations. In each combination, the scores of different matchers are normalized first and Sum rule is used to get the final score. Then different thresholds are set to compute the FRRs and FARs. Table 2 shows the EER of multi-modal fusion among the 7 normalization methods. In order to evaluate the performance precisely, for each fusion, we give each matcher the performance mark. The performance mark for the best matcher is 7 and followed by 6, 5, 4, 3, 2 and 1. If the performance of two matchers are the same, for example, both are the second best, then the two matchers get the same mark (6+5)/2=5.5. Table 3 is the performance mark of different fusion techniques which is measured by EER. From Table 3, we can easily find that the proposed fusion method FF shows the best performance because the total mark is the largest one. And we observe that the PL and Zscore methods also perform well. To show the comparison of all the algorithms in multimodal biometric systems, Fig 1 shows the EERs of the 7 normalization algorithms. From the last column of Table 3, the sum of performance mark summaries the performance from EER aspects. It is easy to find that FF, PL and Zscore methods give better performance than other normalization methods. Fig 2 and

170
Emerging Engineering Approaches and Applications         Table 4 shows the sum of performance mark of different normalization techniques based on EER, AUC and GAR (FAR=0.001). From AUC aspects, PL, RHE and Zscore techniques outperforms other normalization methods. From the GAR (FRR=0.001), PL, LTL and FF normalization algorithms give better performance than other algorithms.In order to verify the stabilization of different normalization techniques, Fig 4 shows the Thresholds of EERs(TE). From Figure 4, we observe that the TE of FF normalization varies slowly and is about 0.5. Also, the TE of Tanh and LTL normalization techniques vary slowly also. FF, LTL and Tanh show better performance than other normalization methods from the change of TE. In section 3, we have introduced that LTL and FSPL are the improvement of PL. In Ref. [6], LTL showed better performance than LTL normlization method, and LTL and FSPL outperformed PL nomalization method. However, our experimental results show that PL works better than LTL nomalizaiton method with EER, AUC and GAR(FAR=0.001).

Conclusions
The experimental results suggest that there is no single normalization technique that would perform the best for all multibiometric recognition systems. Four measures: EER, AUC, GAR(FAR=0.001) and the threshold of EER, are selected to evaluation of different normalization techniques. Different normalization functions should be choosing according to different applications. FF, PL and Zscore should be chosen if EER is the performance measure; PL, LTL and FF should be chosen if GAR(fixed FRR) is the performance measure; PL, Tanh and RHE should be chosen if AUC is the performance measure; FF, LTL and Tanh should be chosen if we want threshold of EER to be fixed. We can conclude that PL and FF normalization work better than other methods in many applications.