Development of a Neural Network Based Q Learning Algorithm for Traffic Signal Control

As one kind of reinforcement learning method, Q learning algorithm has already been proved to achieve many significant results in traffic signal control area. However, when the state of Markov Decision Process is very big or continuous, the computation load and the memory load will become very big and can not be solved then. Therefore, this paper proposed a neural network based Q learning algorithm to solve this problem known as “Curse of Dimensionality”. This new method realized generalization of conventional Q learnig algorithm in huge and continuous state space as neural network is a very effective value function approximator. Experiment has been implemented upon an isolated intersection and simulation results show that the proposed method can improve the traffic efficiency significantly than the conventional Q learning algorithm.


Introduction
Nowadays more and more attention has been focused to the application of reinforcement learning on real time traffic flow control.This method considers learning to be a trial-and-error process.Sutton proposed a developed learning algorithm for non-deterministic Markov decision processes [1].Lu Shoufeng applied table Q-learning to dynamically control the traffic signals at an isolated intersection [2].Wei Wu also developed a coordinated urban traffic signal control approach based on multi-agent reinforcement learning [3].Marco Wiering developed a Multi-Agent reinforcement learning method for traffic light control of six adjacent intersections [4].
The advantage of reinforcement learning is that no mathematic model of the controlled object needed, the system can perceive the varying condition and self-adaptively adjust the control policy in order to respond to traffic conditions.It makes the control object optimal due to its self-learning ability.However, the problem known as "Curse of Dimensionality" which was supposed by Bellman in his book at 1961 is still exist when we apply the traditional reinforcement learning algorithm.When the state of Markov Decision Process is very big or continuous, the computation and memory load will become very big and can not be solved then.On the other side, in the traditional Q learning algorithm, Q value is updated in the form of table record, the effeciency of this kind of learning is relatively slow, which will directly influence the performance of the controller [2].
To solve the problem of "Curse of Dimensionality", and realize the high effeciency of reinforcement learning in huge and continuous state space, Value Function approximation based learning method has been widely researched and applied in recent years.This paper proposes a neural network based Q learning algorithm to find a dynamic planning of traffic flow control and we proposed a heuristic knowledge based Q learning algorithm.
This paper was organized as follows.Section 1 introduced Q learning algorithm and the improvement.The experiment and simulation results were studied in Section 2. Finally, section 3 provide and conclusion of our research.

Q Learning algorithm and its improvement
1.1 Traditional Q learning algorithm Q learning algorithm was first supposed by C.Watkins in his thesis at 1989, which was applied to the iterated computation of value function in the Markov Decision Process.The iteration equation is: (1 ) ( , ) [ ( , ) max ( , )] α > is the learning factor.

BP Neural Network based Q learning algorithm
In this paper, we use the BP learning algorithm to realize the learning process of output layer and hidden layer weights.The structure of the neural network can be seen in Fig. 1.

Fig. 1. Structure of three layer neural network
Supposed that there are n neurons in hidden layer and the output of hidden layer in ith neural network is i y , value of hidden layer weight of output layer is i w , and then we can get the evaluate result of the action value function as: 1 ( , ) ( ) The transfer functions of hidden layer and output layer are defined as follows: Thanks to the research in [5], we know that learning process here is not like ordinary supervised learning where we can learn from (input, output) example.Here we're not presenting constant , x Q x * ( ( ) ) examples to the network but instead we are learning from estimates of is an estimate, which may come from a different network, but it is the maximal value of the resulting state.The weights in the network are updated to minimize the following quadratic performance error measure: The updated algorithms of layer are derived as follows: The updated algorithms of hidden layer are derived as follows: Here 0 t a > is the learning factor of the neural network at time t.

Heuristic knowledge based Q learning algorithm
In order to improve the efficiency of the learning system, we supposed a heuristic knowledge based Q learning algorithm.Assuming that the saturate flow rate of the traffic is s, the arriving flow rate is q, and the green time that will be decided is t ∆ .Then we can calculate the arriving traffic during t ∆ is q t ∆ .And the time for this arriving traffic to disperse is: To ensure that the traffic queue can disperse during the green time, we get g t ≤ ∆ .We can calculate the delay time for the green lanes as: Here 0 r q is initial traffic in green lanes.So the total delay time during t ∆ is: Finally the average delay for the entire intersection is: q g q t g t q q t q q t D H q q t d q q t q q t q t q t g t q q t q q t Advanced Engineering Forum Vols.2-3 Equation ( 12) is the heuristic function that will be added into the learning system.We can see that this function can evaluate the performance of the decision, given the environment information that we know.
After combining the above two methods together, ( 1) and ( 13) can be combined as finally as: The improved algorithm can be summarized as follow: (1) Initialize Q(s,a), H(s,a); (2) Observe the state at time t; (3) Pretend to execute each action, observe each new state and receive each reward r; (4) Updating Q function according to (13); (5) Choose one action according to: The performance of this algorithm will be tested in next section.

Isolated intersection model
This chapter will explain how to apply supposed algorithm into the isolated intersection model.As shown in Fig. 2, our method is applied to a traffic intersection that consists of two intersecting roads, each with several lanes and a set of synchronized traffic lights that manage the flow of vehicles.

Fig. 2. Sketch map of intersection
Initial traffic data is randomly generated between 0 and 10, and the data for neural network input comply with Poisson distribution: The initial weights of neural network are randomly generated between 0 and 0.5.When the entering flow rate is 720veh/h, along with the traffic state updating, the input data is generated resulting from the remaining traffic after the end of previous action and the random arriving traffic which comply with Poisson distribution.

Simulation result
In order to show the performance of the algorithm in traffic control, testing simulation is performed in this section.Assuming that the arriving rate of traffic is known in 5 minutes, and the optimal traffic signal will be distributed according to the improved neural network and knowledge based Q learning algorithm (NNHK_Q algorithm).We assume that the arriving flow rate of traffic is 720 veh/h, the simulation was running for 100 cycles and we calculated the average delay at the end of each cycle, the simulation results are shown in Table I:

Conclusion
Traditional table Q learning algorithm was introduced and its disadvantage named "Curse of Dimensionality" was also discussed.In order to solve this problem, BP neural network based Q learning algorithm was studied since neural network has achieved many results in supervision learning field as a popular function approximation.Then in order to improve the performance in traffic control, we proposed a novel NNHK_Q algorithm which was added a heuristic knowledge based function, because heuristic function supports the model information of the environment.Finally, a simulation for such intersection system was carried out, and a comparative study with Actuated control method and Neuro-fuzzy control method was accomplished.The simulation results show that the proposed intersection control mechanism is encouraging and can bring vehicle drivers some benefit by decreasing the average delay. ]

Table 1
Average delay performance for the whole intersection (720veh/h)