Session Segmentation Method Based on Naïve Bayes Model

Session segmentation can not only contribute a lot to the further and deeper analysis of user’s search behavior but also act as the foundation of other retrieval process researches based on users’ complicated search behaviors. This paper proposes a session boundary discrimination model utilizing time interval and query likelihood on the basis of Naive Bayes Model. Compared with previous study, the model proposed in this paper shows a prominent improvement through experiment in three aspects, which is: recall ratio, precision ratio and value F. Owing to its advantage in session boundary discrimination, the application of the model can serve as a tool in fields like personalized information retrieval, query suggestion, search activity analysis and other fields which is related to search results improvement.


Introduction
To make the search engines provide users with personalized service, we need to comprehend users' searching habits, to find out user's interest. Using user's query words is an important way to understand its interest, but they are complicated. If only query words classification was fully understood, it would do a lot to interest mining. Classification for queries requires a criterion; so far the relatively recognized criterion has been session segmentation. Previous researches have different definitions about the meaning of session. We much more agree with Zhang Lei and accept that session is a group of sequences of activities related to each other not only through an evolving information need at a deeper, conceptual level but also through close proximity in time. We intend to group these activities together and refer them as a session. Previous documents [1,2,3,4,5,6,7,8] all utilize attribute time interval (TI) to achieve session segmentation. Other attributes like query likelihood (QL) and anchor likelihood (AL) are also used to divide session. Generally speaking, session segmentation methods based on attributes is broadly divided into two categories: time interval method and multiple attribute method. Time interval session segmentation method. Time interval is recognized as a significant attribute in session segmentation [1,2,3,5,6,7] . Daqing He described and discussed the research based on two web logs: Excite and Reuter with a view to divide sessions, they utilized only one attribute time interval (TI), and tried different values for TI to find out how the quantity of queries in one session change when TI differs. And the conclusion they have derived is that when TI is between 10 and 15 minutes the query amount in one session tend to be stable. It's not unreasonable to conclude that 10 and 15 minutes is the critical interval with regard to session segmentation. Multiple attributes session segmentation method. Documents [3,4,5,6] were written by same group of authors and gave the same definition for session. These papers adopt the same method to deal with query log and define that all the interactive information between the user and his search engine is a session. At last, they identify different topics which are contextual irrelevant, this work is called Topic Shift Identification. The emphasis of their work is how to divide sessions with two attributes in query log. The method they used is to discrete time interval into 7 ranks: divide 0 to 30 minutes into 6 parts averagely; and the bracket equal to or more than 30n minutes constitutes the seventh rank. The relationship between current query (Qc) and the previous query (Qp) is referred to as search pattern (SP). With regard to SP, it is also divided into 7 ranks averagely based on their likelihood in semantic relationship. 7 kinds of time intervals and 7 kinds of search patterns make up 49 combinations; any successive queries belong to one of the combinations. To make topic identification on such a data set, they have artificial neural network [4] , Multiple Linear Regression [5] and Monte Carlo simulation [6] . Their common feature is that they have a better performance in continuation identification than shift identification. The three methods respectively in document [4,5,6] have serious defects in topic shift identification and as they have always stressed. There are two reasons accounting for that: (1) Time interval correctly utilized, 30minutes act as the maximum time interval without sufficient evidence; (2) Search pattern used are not proper, for granularity is too large.
This paper proposes to adopt Bayes Model into session segmentation and is organized as follow: Segregation method based on Naive Bayes Model is presented (NBM-SBDM) in Section 2. Section 3 shows the experiment schemes, details and the results of the method. At last, we will conclude the method in the paper and propose directions for further research in Section 4.

Session Boundary Discrimination Model (NBM-SBDM)
The center part of session segmentation is to discriminate whether each query is a session boundary, if it is, called yes class; otherwise, it is labeled no class. The session is correctly divided only when each query in it is exactly right discriminated. To realize it, we adopt Naïve Bayes Model.
Definition 3. Query Likelihood (QL): QL is an attribute to quantize the semantic likelihood between current query and previous query, denoted by: Where, #t is a variable to define how many times t term appears, Q c represents current query and Q p represents previous query.

Fig. 1 NBM-SBDM
Session Boundary Discrimination Model. Document [8] finds out that the performance of session segmentation when only time interval and query likelihood are taken into account is no worse than the result when time interval, query likelihood and anchor likelihood three attributes are all considered. Thus, to make the experiment process concise, we might as well select time interval and query likelihood to do session segmentation job and assume attribute TI and QL are conditionally independent over class variable C. With this assumption, the session boundary discrimination model based on Naïve Bayes Model is as shown in Fig.1. The prior probability is acquired by manually labeling part of Sogou query log. The statistical data is presented in Table 1. Hereinto, prior probability P (C=c), P (TI|C), P (QL|C) is computed by the formulas: Here (ti 1 ,ti 2 ) is a bracket for TI in Fig.1, and (

Information Technology for Manufacturing Systems III
In Fig.1, given C, variable TI and QL are conditional independent; C=yes represents that current query is a session boundary (the beginning of a session), C=no means current query is not a session boundary. And session boundary model is expressed in this way:

Border ti ql P C c TI ti QL ql
Abstractly, SBDM is a conditional probability model P (C|TI, QL); class variable C is dependent on two attributes TI and QL. The problem is that to present probability table based on the model is too complicated, we therefore reformulate the model to make it more tractable: In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features TI and QL are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model P(C, TI, QL) which can be rewritten as follows:

( )
, , ( ) ( | ) ( | , ) P C TI QL P C P TI C P QL C TI = (9) Now the "naive" conditional independence assumptions come into play; it means that the joint model can be expressed as: Under the above independence assumptions, the conditional distribution over the class variable can be expressed like this: The discussion so far has derived the simplified model Border (ti, ql):

Border ti ql P C c P TI ti C c P QL ql C c
NBM-SBDM Algorithms: (1) Data Preparation. Queries in original data are sorted by time. There are lots of users who are engaged in information retrieval even at the same second. Successive queries of one user need to be extracted. In order to discriminate session of different users' queries, we tidy the data by ID sorting to make the queries successive for one user. (2) Session Boundary Discrimination. Compute attribute TI and QL and discriminate session boundary with the NBM_SBDM proposed in the paper.

Experiment
Evaluation criteria of the session segmentation. Document [8] uses both evaluation criteria, that is, the query boundary evaluation criterion and the session for the evaluation criterion. In order to facilitate analysis and discussion, we use common evaluation framework [9,10], namely, the use of standard criterion to check boundaries. In order to carry out evaluation better, we may assume that the two situations: (1) Experimental results: We discriminate the session boundaries and non-session boundaries in training set manually. In Table 2, we will display the comparison of the result between Decision Tree and SBDM Algorithm. Table 2 Result Comparison The experiment results show that P, R, F are all greater than 90 percent for both session discrimination and non-session discrimination. Allowing for that session and non-session are inverse, the performance of the proposed model in precision ratio, recall ratio and value F is great.

Information Technology for Manufacturing Systems III
The value F in SBDM Algorithm is greater than that in Decision Tree for session discrimination as Table 2 shows. Compared with the results achieved by Decision Tree [8] which adopts the same evaluation criterion as ours, the method in paper improved value F for class yes and kept a high value F for no class simultaneously. By horizontal comparison, the method achieves an improvement in session boundary discrimination. And session discrimination is much more significant than non-session discrimination when it comes to session segmentation. Because session identification errors will lead to queries with different intentions separated into one session, it will bring about many noises to session analysis.

Summary
This paper outlined current status of session boundary discrimination, and gives a description for the significance of session segmentation to users' retrieval behaviors in personalized search engine field and proposes a new approach based on Naïve Bayes Model to discriminate session boundaries. This idea is initiative in the field. Experimental result shows that this model performs well in session boundary discrimination work. In fact, this target is session segmentation job has always been striving for.
Despite the good job done by the model, there is still some advancement that can be achieved. As the model proposed in this paper is quite dependent on prior probability like P(C), P(TI|C), P(QL|C) and the prior probability rely a heavily on the sample manually labeled from training set, there are more or less discrimination errors in testing test set. It is reasonable to consider that more volume of training set will make for a more stable prior probability. However, oversize sample can be intractable for manually labeling. With respect to this, we will shift our attention after this paper and do studies about how sample size of training set can make a difference in the performance of session discrimination.