Multimedia Quality Integration Using Piecewise Function

In this paper the important issue of multimedia quality evaluation is concerned, given the unimodal quality of audio and video. Firstly, the quality integration model recommended in G.1070 is evaluated using experimental results. Theoretical analyses aide empirical observations suggest that the constant coefficients used in the G.1070 model should actually be piecewise adjusted for different levels of audio and visual quality. Then a piecewise function is proposed to perform multimedia quality integration under different levels of the audio and visual quality. Performance gain observed from experimental results substantiates the effectiveness of the proposed model.


Introduction
The next-generation network (NGN) is expected to enable service providers to build highperformance networks for converged services. Rapid development of technologies in networks and multimedia has led to a proliferation of various multimedia services such as videophones, videoconferencing, video on demand (VoD), and IPTV. Since for most multimedia applications over the NGN, the end users would be human beings, the services should be provided with sufficient quality in terms of human perception. It is therefore important to properly evaluate users' perceptual quality of multimedia applications, i.e., quality of experience (QoE), in order to provide users with economical and comfortable services.
Research on understanding human perception of the quality of services has been traditionally focused on individual modalities. Given that both audio and visual quality models are relatively mature [1]- [3], it seems logical to use them as primary inputs if the quality of multimedia services is concerned. The research on multimedia quality modeling then addresses how to integrate information of two modalities of video and audio together, to describe the multimedia quality combination rules, and to apply these rules to the unimodal inputs entering the multimedia quality model. A general framework of the multimedia quality model is shown in Fig. 1. In the framework, information about audio and visual quality is employed as the input of the multimedia combination function model where specific rules apply to predict subjective quality of the multimedia. Research found that a human integrates multimodal information together mainly following three algebraic rules [4]: adding, averaging and multiplying. D. S. Hands further suggests that subjects integrate errors in audio and video together following the multiplicative rule [5]. Using regression analysis method, objective multimedia quality models have been proposed for multimedia quality assessment [5]- [10], whereas little attention has been paid to the variability of the model under different levels of the quality of the input audio and video.
In this paper, we propose a model for multimedia quality integration using a piecewise function, which estimates the overall multimedia quality under different level intervals of the audio quality and the visual quality. Firstly, the multimedia quality model reported in G.1070 [9] is studied and analyzed. The analyses and observations show that coefficients in G.1070 should be piecewise changed instead of being constant as its present form. Therefore we propose to perform multimedia quality integration using a piecewise function accommodating different quality levels of the input audio and video.
The paper is organized as follows. Section 2 introduces the source materials, test conditions and the subjective assessment method used for evaluation in this paper. In section 3, we employ the data obtained in the subjective experiments to analyze the multimedia quality integration function reported in G.1070, where the problem of the related coefficients being constant under different levels of the audio and visual quality is described in section 4. Based on the analysis, a piecewise function is proposed to formulate multimedia quality integration in the same section, and its performance is verified in terms of correlation coefficient and evaluation error between subjective and estimated qualities. This paper closes with conclusions given in section 5.

Source Materials
Three video clips from the consumer digital video library [11] were used for all experiments, where the content is shown in Fig. 2. Specifically, clip A shows a man giving an overview of National Telecommunications and Information Administration (NTIA). Clip B is about a woman describing her granddaughter's art work, and clip C shows a man introducing the U.S. spectrum chart. The source materials are about 15 seconds each, 16-bit PCM mono sampled at 8kHz, and at 30fps, QCIF in the frame size. For each of the test clips, the video quality was set variable by using different quantization parameters (QP). That is, in the experiments, QP = 3, QP = 6, QP = 9, QP = 12, QP = 15, QP = 18, QP = 21, QP = 25, QP = 27, and QP = 31, were respectively used to encode the source clips to produce sequences in different visual quality for tests. Audio conditions were changed by bitrates and packet loss rates as illustrated in Table 1.

Assessment Method
The absolute category rating (ACR) method [14] was used in the subjective tests. The viewers have evaluated the video quality using a slider device and a continuous grading scale marked with "Excellent", "Good", "Fair", "Poor" and "Bad" respectively. The subjective scores are therefore quantized on a scale of 1 to 5. 20 subjects (10 female, 10 male) with normal vision and hearing participated in the tests. The monitor used in the subjective assessments was a 19" LCD screen, which was operated at its native resolution of 1440 × 900 pixels. The viewing distance was not fixed, and was about 5 times the height of the video frame. The subjective tests consisted of one session of about 30 minutes,

376
Emerging Engineering Approaches and Applications including 5 minutes for rest. The tests were composed of three parts: the audiovisual quality test, the audio-only quality test and the video-only quality test, respectively. In the audio-only quality test, the screen was set gray. In the video-only test, there was no headphone provided. Subjects were asked to rate the quality of the presentation in each case. The order of the clips was randomized for each individual subject. The subjective quality of each clip was represented as a mean opinion score (MOS) calculated by averaging the scores of 20 subjects. To evaluate the integration mechanism in multimedia quality perception, the perceived audio, visual and audiovisual qualities were obtained by extensive subjective evaluations, where the test conditions are described in section 2. In this paper, the audiovisual or multimedia quality is denoted as MOS AV , while the corresponding perceived audio quality and video quality are denoted as MOS A and MOS V , respectively. It has been accepted that subjects integrate audio and video together following the multiplicative rule [5], which can be described as: (2) where ρ, c 0 and c 1 are parameters constantly determined for a specific task.
Based on the above multiplicative rule, a more suitable function for multimedia quality integration is further proposed through regression analysis [5], used in G.1070, as:

MOS a MOS a MOS a MOS MOS a
where coefficients a 1 , a 2 , a 3 , and a 4 are dependent on the display size of the related video and characteristics of the conversational task.

Empirical Analysis
In this subsection, the multiplicative rule and the model provided by G.1070 are evaluated using MOSs. Given the model formulated by (1) and (2)

Piecewise Characteristics of Coefficients in G.1070
For clarity of analyses, the model in (3) is reformulated as:  (5). The slope of the function in (5) equals to a 7 , and the intercept equals to a 8 .
In the original form of G.1070 model, the coefficients of a 1 , a 2 , a 3 , and a 4 are given for a certain task when the display size of video is determined. In this case, a 5 and a 6 should change linearly with MOS A , while a 7 , and a 8 should change linearly with MOS V . However, experimental results show that the linearity in a 6 and a 8 is quite questionable, as shown in Fig. 5 and 6. Generally speaking, when MOS A < 2, the slope of a 6 is greater than that at MOS A > 2, and the intercept is comparatively smaller in Fig. 5. Accordingly, a 1 should be greater when MOS A < 2 than MOS A > 2, and a 4 should be smaller in the same case. As shown in Fig. 6, the linearity of a 8 twists

378
Emerging Engineering Approaches and Applications when MOS V reaches about 2 as well. Similarly, when MOS V < 2, the slope of the curve is greater than MOS V > 2, and the intercept is smaller. Therefore, a 2 should be greater when MOS V < 2 than at MOS V > 2, and a 4 , on the other hand, should be smaller for MOS V < 2. Based on the above analyses, coefficients a 1 , a 2 , a 3 , and a 4 should be different when the audio or the visual quality lie in the interval of (0,2) and when any of the quality is in the interval of (2,5). Considering possible combinations, the coefficients in G.1070 have piecewise characteristics and should not be set the same for different levels of audio or visual quality.

Multimedia Quality Integration Using Piecewise Function
To incorporate the piecewise characteristics of the coefficients into quality integration, a piecewise function is proposed to obtain the multimedia quality where the coefficients are empirical obtained for different levels of the audio and the visual quality. That is,  To evaluate the performance of the proposed model, the MOS AV is estimated using the proposed method and the G.1070 model, respectively. The relationship between estimated multimedia qualities and the actual subjectively measured MOS AV is shown in Fig. 7. Compared with the multimedia model reported in G.1070, the proposed model provides a better fit with the actual data. Specifically, the Pearson correlation coefficient (PCC) and the root mean square error (RMSE) between estimated multimedia qualities and the actual data are 0.99 and 0.12. Comparatively, the PCC and RMSE between the model recommended by G.1070 and the actual data are 0.98 and 0.15. Clearly, the proposed model achieves better performance in accuracy.

Conclusions
This paper addresses how to accurately assess the perceived multimedia quality given the individual audio and visual quality. Extensive subjective tests have been performed to evaluate the model recommended in G.1070, where analyses reveal that the corresponding constant coefficients should actually be adapted to the different levels of the unimodal quality. Accordingly, we propose to formulate the multimodal quality integration using a piecewise function. The proposed model has been shown to be more accurate compared to the G.1070 model.