Researchers Yuting Yang, Camillo Taylor and Daniel Lee have developed a system to turn surveillance cameras into traffic counters.    
     
Traffic information can be collected from existing inexpensive roadside cameras but extracting it often entails manual work or costly commercial software. Against this background the Delaware Valley Regional Planning Commission (DVRPC) was looking for an efficient and user-friendly solution to extract traffic information from videos captured from road intersections. 
     
We proposed a method that tracks and counts vehicles in the video and to use the tracking information to compute a 3D model for the vehicles and visualise the 2D road situation into 3D. The 3D model can provide feedback on the tracking and counting model for future research.
     
The proposed method aims to solve two tracking and counting difficulties when working from video. Firstly, unlike normal highway traffic vehicles approaching an intersection can go straight ahead or turn in either direction, which will make our system much less constrained and predictable. The second is the perspective deformation in the image, caused by the low position of camera and distortion, which creates difficulty in recognising true relative distance on road and in reconstructing the 3D model.
     
To deal with the variety of movement, the system was trained to predict the vehicle’s movement in the next few frames dependent on its current position and velocity. This system, known as a classifier, then categorises each vehicle depending whether it is predicted to go straight, turn left or turn right and is based on the assumption that the vehicles use the dedicated lanes for turning or going straight. This classification makes it easier to separate adjacent vehicles with different driving directions, and can help improve the performance of tracking and counting.
     
As no camera information was provided with the video, solving the perspective deformation problem required manually calibrating the camera. There is a need to calibrate the principal point position in the x and y direction and the radial distortion parameter, and then adjust these parameters to make the parallel lines in the image intersect at a same point (the vanishing point). The parallel lines are drawn according to the road markings. By calibrating the camera and acquiring the intersection of the parallel lines, it became a simple geometric problem to project a nadir view – that is the equivalent to viewing the scene from directly above. The nadir view in Figure 1 is not a perfect real world restoration as the height of the vehicles was not taken into account, but is done to calculate same-scale distance across the image as a threshold for vehicle tracking. Since the vehicle's height is low compared with the camera's height, this approximation was good enough for the initial stage, leaving the vehicle’s height to be considered in the 3D reconstruction.
 
 For tracking and counting, we established a  feature-based tracking system that uses a Kanada-Lucas-Tomasi tracker to  track ‘feature points’ detected in the image. Each feature point’s  relative distance was then rectified according to the nadir view and  grouped into individual vehicles according to properties such as  proximity, distance, color, speed and direction. The 3D reconstruction  part is then added. 
     
In  effect the vehicle’s 3D position is approximated by the 3D position of  the feature points that represent the vehicle and is calculated on the  assumption that the vehicle is a rigid body. This means the distance  between each pair of points on the vehicle does not vary and therefore  the orientation between the two is in accordance with the direction of  the vehicle. The actual data is acquired by solving an optimisation  problem in which the variables are the 3D positions of the feature  points, and the objective is to minimise the reprojection error between  the points' real position on image and their estimated position on the  image. In this system, the optimisation is directly solved using MATLAB  built-in function lsqnonlin.
     
One  issue for structure from motion method is that it can only solve for  the point positions up to a scale factor – the inability to  differentiate between 1m cubic 1m from the camera and a 2m cubic 2m from  the camera. To overcome this we needed to arbitrarily assign a factor  for the set of points. By assuming that the vehicle is above ground but  not flying, this can be done by scaling the result until its lowest  point reaches zero (that is the wheels are touching the ground). 
     
Since  we calculate the 3D positions of certain feature points, the vehicle is  represented as a bounding box moving in the calculated direction. 
     
This  method has been tested on six different 10-second video clips covering  different trajectories, and the accumulative accuracy in counting the  vehicles is shown in Table 1. To differentiate the contribution each  part of the method contribution to the overall performance, some  subsystems were removed and the results compared with the original  findings. Table 1 shows the results of the original method in the blue  column; results in the red column are without camera calibration; and  the green column shows the results where the grouping algorithm has been  replaced with a strategy that grouped nearby points together.
     
In  the original method, over-counted situations occur mostly in extreme  cases such as a large truck with relatively few feature points which the  system sometimes mistakenly interpreted as two or more vehicles. Missed  vehicles are mostly due to occlusion (blocked or overlapping vehicles).  
 
The results were not  badly affected by removing the camera calibration  as this is mostly  needed for 3D reconstruction. However, without such  calibration there  can be difficulty in determining a distance threshold  to group nearby  feature points as the ratio between pixel and real  world distance varies  across the image. As the grouping technique was  able to differentiate  nearby vehicles travelling in different  directions (straight ahead/turn  left/turn right), the distance  threshold was set at a high level as  feature points at some distance  from each other could belong to a same  vehicle. Setting a high  threshold reduces the over-counted vehicles, but  slightly increases the  number of missed vehicles. This is because the  large threshold caused  the system to recognise several vehicles as one. 
     
Removing   the grouping technique had a far greater impact on the correct  counting  of vehicles. An increase in over-counted vehicles was caused  by the  lack of an algorithm to combine small groups of feature points.  While  the increase in missed vehicles was only a small, the actual  result is  worse in per-frame demonstration. This is because without the  grouping  algorithm, adjacent vehicles travelling in different  directions could  not be differentiated until they were physically  separated, which is  less efficient.
     
As  this  technique only requires video information from the cameras with  the  computing work carried out at the control centre, we believe this  system  could be added to the feed from other existing cameras with a  minimum  of manual setup. However, as the speed of the system is  constrained by  the 3D reconstruction process, the system can only do  off-line  ‘snapshots’. Manual camera calibration may not be needed when  the user  has the camera’s precise location and height. 
     
The   training process will be needed at each new intersection because in   order to predict which direction the vehicle will be moving, we need to   first manually give some ground truth values to the classifier. This   enables it to ‘learn’ the correlation between a vehicle’s current   speed/position and to judge the possible trajectory. 
     
Currently   the system is being used to show how the road situation would look in   3D, not to facilitate the counting system. It is planned to feed the 3D   vehicle reconstruction information back into the counting system and   information such as whether this vehicle has reasonable size and   position will help indicate if the counting is reliable. With a known   camera position, the 3D model can be used to predict occlusion and   devise a way to detect masked vehicles.
     
Having   shown the proposed method can count vehicles and compute their 3D   position using video from cameras installed at road intersection after   initial training, we plan to feedback the 3D model into the counting   system to increase accuracy. 
 
ABOUT THE AUTHORS: 
Yuting Yang is a graduate of the University of Pennsylvania, Camillo Taylor and Daniel Lee are professors at the University.    
 
    
        
        
        
        



