A Detailed Discussion on the Research Paper - 3D Multi-Object Tracking in Point Clouds
This paper proposes a new 3D multi-object tracker to more robustly track objects that are temporarily missed by detectors. This tracker can better leverage object features for 3D Multi-Object Tracking (MOT) in point clouds. The proposed tracker is based on a novel data association scheme guided by prediction confidence, and it consists of two key parts. First, we design a new predictor that employs a constant acceleration (CA) motion model to estimate future positions, and outputs a prediction confidence to guide data association through increased awareness of detection quality. Second, we introduce a new aggregated pairwise cost to exploit features of objects in point clouds for faster and more accurate data association. The proposed cost consists of geometry, appearance and motion components.
Specifically, we formulate the geometry cost using resolutions (lengths, widths and heights), centroids, and orientations of 3D bounding boxes (BBs), the appearance cost using appearance features from the deep learning-based detector backbone network, and the motion cost by associating different motion vectors.
3D - MULTI-OBJECT Tracking (MOT) is a crucial technology for extracting dynamic information from the road environment. It is widely applied in multiple intelligent transportation systems (ITS)-related applications such as autonomous driving and traffic monitoring.
To achieve better performance on both accuracy and speed, they have proposed a new 3D tracker based on a new data association scheme guided by prediction confidence, which can effectively exploit features of objects in point clouds and track temporarily missed objects. As shown in Fig. 1, the proposed tracker follows a four-step tracking-by-detection framework:
(1) detect objects from point clouds by employing a deep learning 3D object detector,
(2) estimate possible current states of tracked objects using our proposed predictor based on a constant acceleration (CA) motion model and a prediction confidence model, and assign different prediction confidence to every predicted state (especially for those with temporarily missed detections),
(3) associate the predicted states with the detected states by the prediction confidence and a proposed aggregated pairwise cost,
(4) update the matched pairs and set
unmatched detected states as tracked states (on which tracking will be performed in the following frames). The main contributions of this paper are as follows:
(1) The prediction confidence output from our predictor guides the data association with the awareness of the prediction quality by adaptively adjusting the implicit search range of data association in 3D MOT.
(2) we design a new aggregated pairwise cost that makes full use of the geometry, appearance, and motion features of objects in point clouds for fast and accurate data association.
The 3D MOT problem can be formulated as: by associating the detected states Xt with the predicted states Zˆt estimated from previous full tracked states Zt−1 (including previously tracked states Z∗ t−1 and previously predicted states of temporarily missed detections Z t−1), to perform state updating and initialization to obtain new tracked states Z∗ t . The detailed definitions of these notations are given as follows. 1) Detected States: Our tracker takes in the object candidates produced by a deep learning-based detector, such as PointRCNN The detection results consist of their 3D BBs, corresponding point cloud features, and detection confidences. Note that the object detector directly outputs detection confidences and the detector backbone network extracts the point cloud features. To estimate the real object motion, all the detected 3D BBs are transformed into the global coordinate system using GPS/IMU data. At discrete time t, these detected states are denoted by
where NX t is the number of the detected states at discrete time t, and DX is the dimension of the detected states. Xi t = [xi t, yi t ,zi t, wi t, hi t,l i t , αi t, f 1,i t ,..., f k,i t ] T is the detected state of the i-th object at discrete time t. x, y,z are its global coordinates; w, h,l are width, height, and length of this object; α is the angle of orientation; and f s are the extracted point cloud features. 2) Previous Full Tracked States: Our tracker also requires a set of previous full tracked states Zt−1 as input, which consists of tracked states Z∗ t−1 and previously predicted states of temporarily missed detections Z t−1. The relationship between them is Zt−1 = Z∗ t−1 ∪ Z t−1. To perform more accurate motion estimation, each full tracked state includes velocities and accelerations along coordinates. The set of full tracked states at discrete time t − 1 are denoted by:
where NZ t−1 and DZ are the number and dimension of previous full tracked states respectively. Z j t−1 = [x j t−1, y j t−1,zj t−1, vx,j t−1, v y,j t−1, vz,j t−1, ax,j t−1, ay,j t−1, az,j t−1, wj t−1, h j t−1,l j t−1, α j t−1, f 1,j t−1,..., f k,j t−1] T , where v and a are the velocities and accelerations along coordinate axes respectively, refers to the full tracked state of the j-th object at discrete time t − 1. 3) Predicted States: To perform a robust data association, we estimate a set of possible current states at t using previous full tracked states. The dimension of the predicted states is same as the previous full tracked state. Hence, we denote a set of the predicted states as:
where Zˆ j t refers to the predicted state for the j-th previous full tracked state at discrete time t. B. Proposed Predictor 1) Constant Acceleration Motion Model: Most of previous 3D MOT methods [2]–[4] employ a constant velocity (CV) motion model to predict future motion state, and smooth the predicted state by a filtering algorithm. However, the CV motion model ignores accelerations, and it may lead to double or more motion errors when the detector temporarily misses objects in consecutive frames. Recent 3D MOT methods also explore the usage of LSTM as a predictor [1] to learn the positions change. But learning-based predictors are usually several times slower than the former motion models. To achieve a better tradeoff between accuracy and speed, we adopt the constant acceleration (CA) motion model to predict future states. For each previous full tracked state Z j t−1, we estimate the possible current state Zˆ j t at discrete time t by the CA motion model given in Eq. 4. For better efficiency, we employ the Kalman Filtering [29] to optimize the predicted state. Its corresponding error covariance S is predicted by Eq. 5.
where A is the state transform matrix; E and O denote the unit matrix and zero matrix respectively; δ is the scanning interval of LiDAR sensor; n = 4 + k is the number of other tracking elements (orientation angle, height, length, width and deep features); Q ∈ RDZ×DZ is the covariance matrix of the state function; St−1 ∈ RDZ × DZ refers to the error covariance at discrete time t −1, and Sˆt ∈ RDZ×DZ denotes the predicted error covariance at discrete time t. According to the state function Eq. 4, the position p and velocity v of every object can be predicted by Eq. 7 and Eq. 8, respectively.
where p ∈ {x, y,z}; pˆ j t , vˆ p,j t are the predicted position and velocity, respectively. 2) Prediction Confidence Model: Most of the traditional MOT methods directly perform data association on the detections and predictions to obtain the object correspondence but lack consideration of prediction quality. The predicted states are not always accurate, especially for the objects missed by the detector in many consecutive frames. The possible range of such objects may be enlarged and should be considered in the data association stage. To make the data association be aware of the prediction quality, we construct a prediction confidence γˆ j t for each predicted state as follows:
where γ ∈ (0, 1] and η ∈ [0, 1]; η is a parameter to adjust the overall influence of prediction confidence on data association and determined by validation on the training dataset; γ j t−1 is the prediction confidence at discrete time t − 1. The prediction confidence will be initialized by one and updated by Eq. 19, details are in Sec. III-E. When the detector misses an object, the prediction confidence decreases. The corresponding implicit search range of data association is then enlarged. An illustration is shown in Fig. 3. By constructing the prediction confidence, the search range of data association can be adaptively adjusted. C. Aggregated Pairwise Cost Previous point clouds-involved methods define the cost just from the overlap or distance of 3D BBs [4], [5], which cannot fully utilize point cloud features. Some recent methods such as [6] extracted point cloud features from neural networks for MOT, but their computational expense significantly increases, which negatively affects the speed. To make full use of point cloud features yet achieve desirable efficiency, we propose a new aggregated pairwise cost for data association that exploits the geometry, appearance, and motion features of objects in point clouds. Our proposed pairwise cost Ci,j is formulated between i-th detected state Xi t and j-th predicted state Zˆ j t by:
where Gi,j t , Ai,j t , and Mi,j t are the geometry, appearance, and motion cost, respectively.
1) Geometry Cost: The geometry cost is to leverage the geometric affinity of 3D BBs. Different from the IoU-based methods [1], [4], we return to the essence of BBs, and define the cost from its resolution (length, width and height), centroid (xyz coordinates), and orientation angle as:
where w{hwl,dis,ang} are the importance weights; N (·) denotes the normalization function; h,w and l are the width, height and length of Xi t , respectively; p refers to the global coordinates of Xi t ; α denotes the orientation angle of Xi t ; and kˆ, pˆ and αˆ are corresponding values of Zˆ j t . 2) Appearance Cost: The appearance cost is usually defined on the features extracted from a newly designed network [13]. For computational efficiency, we reuse the objects’ appearance features extracted from detector backbone
where wapp is the importance weight; f and ˆf are point cloud features of Xi t and Zˆ j t , respectively. 3) Motion Cost: To more robustly formulate the motion cost, we introduce a new detected state X˜ i,j t with velocity when the i-th detected state Xi t matches with j-th predicted state Z˜ j t as follows:
where H ∈ RDZ×DX is a transforming matrix with the zero initialization of velocity and acceleration, and K ∈ RDZ×DZ is a computing matrix for velocity. Our motion cost is defined between two velocity vectors as:
where w{vdis,vang} are importance weights. v˜ and vˆ are the velocity vectors of X˜ i,j t and Zˆ j t , respectively. All the aforementioned importance weights are determined by six-fold cross validations on the training dataset.
D. Prediction Confidence-Guided Data Association To obtain the object correspondences, most of the previous 3D MOT methods directly conduct data association on predictions and detections by employing an association strategy. However, based on our observation, the predictions are not always accurate, especially for the objects missed by the detector in many consecutive frames, the prediction errors of which could be accumulated without state update. The possible range of such objects may be enlarged and should be considered in the data association stage. To tackle this, we propose the prediction confidence-guided data association that takes into account the prediction errors and flexibly adjusts the search range to perform a more robust data association. An illustration is shown in Fig. 3. First, to associate the predicted states and the detected states at discrete time t, we introduce a prediction confidence-guided association matrix, ψt , which is computed by the prediction confidence (Eq. 9) and the proposed pairwise cost (Eq. 10) as:
Then, we apply a greedy algorithm [16] as the association strategy to obtain the object correspondence. After data association, we obtain the matched pairs (Xσ t ,Zˆ σ t ), which denotes the matched detected states and corresponding predicted states, respectively. We also obtain unmatched detected states, which are denoted as X t and used to initiate new tracked states (see Section III-E). In addition, there are some unmatched predicted states Zˆ t , which are determined if to be retained for tracking in the next frame (see Section III-F). E. Tracked State Updating and Initialization After data association, by using the widely adopted Kalman Filtering (KF) [29], we perform a tracked state updating on matched pairs of detected states and predicted states to obtain updated states as Zσ t ← K F(Xσ t ,Zˆ σ t ). Suppose there is Gaussian noise P ∈ RDX×DX in the detection process and the features of the detected states are inter-independent. Each Zσ t ∈ Zσ t and its corresponding error covariance Sσ t are updated by
where B ∈ RDX×DZ is a measurement matrix; P ∈ RDX×DX denotes the covariance of detection; E ∈ RDZ×DZ is a unit matrix; J σ t ∈ RDZ×DX refers to the computed Kalman Gain; (Xσ t , Zˆ σ t ) is a tuple of matched pair in (Xσ t ,Zˆ σ t ); Sˆσ t is the corresponding predicted error covariance of Zˆ σ t . Based on the corresponding detection confidence cσ t
where cσ t−1 = 0 denotes the confidence of missed detection at discrete time t − 1. We also initialize a set of new tracked states from unmatched detected states as Z t ← Init(X t ). Specifically, for each unmatched detected states X t ∈ X t , we perform an zero initialization on the velocity and acceleration as Z t = H X t (Z t ∈ Z t ). Simultaneously, the corresponding error covariance and prediction confidence are initialized by S t = E and γ t = 1, respectively. After tracked state updating and initialization, we obtain a set of current tracked states Z∗ t = Zσ t ∪ Z t , which are also used to perform further tracking at discrete time t + 1. F. Tracking of Temporarily Missed Detections There are two cases for unmatched predicted states Zˆ t . The first case is that the object naturally disappears in the field of vision. The second case is that the object is missed by the detector due to being temporarily occluded by other objects or too far away from the LiDAR sensor. We set a prediction threshold N p to distinguish these two cases. When a predicted state can not be updated by the detected state for more than N p frames, it is treated as natural disappearing. Such a predicted state will be deleted and do not track again. Otherwise, the predicted states are retained due to that the objects may be temporarily missed by the detector and reappear in the future frames. The predicted states of the missed detections at discrete time t, denoted as Z t , are used to iteratively perform state prediction and data association at discrete time t + 1.
final output using 3d MOT Tracker
---------------
Comments