top of page
Writer's pictureMLV Prasad

POINT-RCNN RESEARCH PAPER

Updated: Dec 26, 2023

An Elaborate Discussion on the working of the POINT-RCNN Algorithm



ARXIV PAPER DISCUSSION
POINT-RCNN


In this paper, The Researchers propose PointRCNN for 3D object detection from raw point cloud.


The whole framework is composed of two stages:


Stage-1 for the bottom-up 3D proposal generation and

Stage-2 for refining proposals in the canonical coordinates to obtain the final detection results.


Instead of generating proposals from RGB image or projecting point cloud to bird’s view or voxels as previous methods do,








This Stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and Background.


The Stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn Better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction.


LETS DIVE DEEP :


  Beyond 2D scene understanding, 3D object detection is crucial and indispensable for many real-world applications, such as autonomous driving and domestic robots. While recently developed 2D detection algorithms can handle large variations of viewpoints and background clutters in images, the detection of 3D objects with point clouds still faces great challenges from the irregular data format and large search space of 6 Degrees-of-Freedom (DoF) of 3D object.


Considering this, the Authors Proposed a novel two-stage 3D object detection framework, named PointRCNN, which directly operates on 3D point clouds and achieves robust and accurate 3D detection performance


The proposed framework consists of two stages,


The first stage aims at generating 3D bounding box proposal in a bottom-up scheme. By utilizing 3D bounding boxes to generate a ground-truth segmentation mask, the first stage segments foreground points and generates a small number of bounding box proposals from the segmented points simultaneously. Such a strategy avoids using the large number of 3D anchor boxes in the whole 3D space as previous methods do and saves much computation


The second stage of PointRCNN conducts canonical 3D box refinement.

After the 3D proposals are generated, a point cloud region pooling operation is adopted to pool learned point representations from stage-1. Unlike existing 3D methods that directly estimate the global box coordinates, the pooled 3D points are transformed to the canonical coordinates and combined with the pooled point features as well as the segmentation mask from stage-1 for learning relative coordinate refinement. This strategy fully utilizes all information provided by our robust stage-1 segmentation and proposal sub-network. To learn more effective coordinate refinements, they have also proposed the full bin-based 3D box regression loss for proposal generation and refinement, and the ablation experiments show that it converges faster and achieves higher recall than other 3D box regression loss.



Instead of representing the point cloud as voxels or multi-view formats , Qi et al. presented the PointNet architecture to directly learn point features from raw point clouds, which greatly increases the speed and accuracies of point cloud classification and segmentation. The follow-up works [PointNET++] further improve the extracted feature quality by considering the local structures in point clouds. The Authors work extends the point-based feature extractors to 3D point cloud-based object detection, leading to a novel two-stage 3D detection framework, which directly generate 3D box proposals and detection results from raw point clouds.


PointRCNN for Point Cloud 3D Detection



The PointRCNN architecture for 3D object detection from point cloud. The whole network consists of two parts: (a) for generating 3D proposals from raw point cloud in a bottom-up manner. (b) for refining the 3D proposals in canonical coordinate.
The whole network consists of two parts: (a) for generating 3D proposals from raw point cloud in a bottom-up manner. (b) for refining the 3D proposals in canonical coordinate.


A. Bottom-up 3D proposal generation via point cloud segmentation


The Authors propose an accurate and robust 3D proposal generation algorithm as their stage-1 sub-network based on whole-scene point cloud segmentation. We observe that objects in 3D scenes are naturally separated without overlapping each other. All 3D objects’ segmentation masks could be directly obtained by their 3D bounding box annotations, i.e., 3D points inside 3D boxes are considered as foreground points. We therefore propose to generate 3D proposals in a bottom-up manner. Specifically, we learn point-wise features to segment the raw point cloud and to generate 3D proposals from the segmented foreground points simultaneously. Based on this bottom-up strategy, our method avoids using a large set of predefined 3D boxes in the 3D space and significantly constrains the search space for 3D proposal generation.


Learning point cloud representations.


To learn discriminative point-wise features for describing the raw point clouds, we utilize the PointNet++ with multi-scale grouping as our backbone network.


Foreground point segmentation.


The foreground points provide rich information on predicting their associated objects’ locations and orientations. By learning to segment the foreground points, the point-cloud network is forced to capture contextual information for making accurate point-wise prediction, which is also beneficial for 3D box generation. We design the bottom-up 3D proposal generation method to generate 3D box proposals directly from the foreground points, i.e., the foreground segmentation and 3D box proposal generation are performed simultaneously. Given the point-wise features encoded by the backbone point cloud network, we append one segmentation head for estimating the foreground mask and one box regression head for generating 3D proposals. For point segmentation, the ground-truth segmentation mask is naturally provided by the 3D ground-truth boxes. The number of foreground points is generally much smaller than that of the background points for a large-scale outdoor scene. Thus we use the focal loss to handle the class imbalance problem as




During training point cloud segmentation, we keep the default settings αt = 0.25 and γ = 2 as in the original paper.


Bin-based 3D bounding box generation.


As it is mentioned above, a box regression head is also appended for simultaneously generating bottom-up 3D proposals with the foreground point segmentation. During training, we only require the box regression head to regress 3D bounding box locations from foreground points. Note that although boxes are not regressed from the background points, those points also provide supporting information for generating boxes because of the receptive field of the point-cloud network.

A 3D bounding box is represented as (x, y, z, h, w, l, θ) in the LiDAR coordinate system, where (x, y, z) is the object center location, (h, w, l) is the object size, and θ is the object orientation from the bird’s view.

To constrain the generated 3D box proposals, they proposed a bin-based regression losses for estimating 3D bounding boxes of objects. For estimating center location of an object, as shown in Fig. 3, we split the surrounding area of each foreground point into a series of discrete bins along the X and Z axes. Specifically, we set a search range S for each X and Z axis of the current foreground point, and each 1D search range is divided into bins of uniform length δ to represent different object centers (x, z) on the X-Z plane. It was observed that using bin-based classification with cross-entropy loss for the X and Z axes instead of direct regression with smooth L1 loss results in more accurate and robust center localization. The localization loss for the X or Z axis consists of two terms, one term for bin classification along each X and Z axis, and the other term for residual regression within the classified bin. For the center location y along the vertical Y axis, we directly utilize smooth L1 loss for the regression since most objects’ y values are within a very small range.


Using the L1 loss is enough for obtaining accurate y values. The localization targets could therefore be formulated as





where (x (p) , y(p) , z(p) ) is the coordinates of a foreground point of interest,

(x^p , y^p , z^p ) is the center coordinates of its corresponding object ,

bin(p)_x and bin(p)_z are ground-truth bin assignments along X and Z axis,

res(p) x and res(p) z are the ground-truth residual for further location refinement within the assigned bin, and C is the bin length for normalization. We divide the orientation 2π into n bins, and calculate the bin classification target bin(p) θ and residual regression target res(p) θ in the same way as x or z prediction. The object size (h, w, l) is directly regressed by calculating residual (res (p) h , res (p) w , res (p) l ) w.r.t. the average object size of each class in the entire training set.



Illustration of bin-based localization

 The surrounding area along X and Z axes of each foreground point is split into a series of bins to locate the object center. In the inference stage, for the bin-based predicted parameters, x, z, θ, we first choose the bin center with the highest predicted confidence and add the predicted residual to obtain the refined parameters. For other directly regressed parameters, including y, h, w, and l, we add the predicted residual to their initial values. The overall 3D bounding box regression loss Lreg with different loss terms for training could then be formulated as




where Npos is the number of foreground points,

binc (p) u and resc (p) u are the predicted bin assignments and residuals of the foreground point p,

bin(p) u and res(p) u are the ground-truth targets calculated as above,

Fcls denotes the cross-entropy classification loss, and

Freg denotes the smooth L1 loss.


To remove the redundant proposals, we have to conduct non-maximum suppression (NMS) based on the oriented IoU from bird’s view to generate a small number of high-quality proposals. For training, we use 0.85 as the bird’s view IoU threshold and after NMS we keep top 300 proposals for training the stage-2 sub-network. For inference, we use oriented NMS with IoU threshold 0.8, and only top 100 proposals are kept for the refinement of stage-2 sub-network.



Point cloud region pooling



After obtaining 3D bounding box proposals, we aim at refining the box locations and orientations based on the previously generated box proposals. To learn more specific local features of each proposal, we propose to pool 3D points and their corresponding point features from stage-1 according to the location of each 3D proposal.

For each 3D box proposal, bi = (xi , yi , zi , hi , wi , li , θi), we slightly enlarge it to create a new 3D box be_i = (xi , yi , zi , hi + η, wi + η, li + η, θi) to encode the additional information from its context, where η is a constant value for enlarging the size of box. For each point p = (x (p) , y(p) , z(p) ), an inside/outside test is performed to determine whether the point p is inside the enlarged bounding box proposal b e i . If so, the point and its features would be kept for refining the box bi . The features associated with the inside point p include its 3D point coordinates (x (p) , y(p) , z(p) ) ∈ R 3 , its laser reflection intensity r (p) ∈ R, its predicted segmentation mask m(p) ∈ {0, 1} from stage-1, and the C-dimensional learned point feature representation f (p) ∈ R C from stage-1. We include the segmentation mask m(p) to differentiate the predicted foreground/background points within the enlarged box be_i . The learned point feature f (p) encodes valuable information via learning for segmentation and proposal generation therefore are also included. We eliminate the proposals that have no inside points in the following stage



Illustration of canonical transformation

The pooled points belonged to each proposal are transformed to the corresponding canonical coordinate system for better local spatial feature learning, where CCS denotes Canonical Coordinate System




 Canonical 3D bounding box refinement


Canonical transformation


To take advantage of our high-recall box proposals from stage-1 and to estimate only the residuals of the box parameters of proposals, we transform the pooled points belonging to each proposal to the canonical coordinate system of the corresponding 3D proposal. As shown in above the canonical coordinate system for one 3D proposal denotes that (1) the origin is located at the center of the box proposal; (2) the local X' and Z' axes are approximately parallel to the ground plane with X' pointing towards the head direction of proposal and the other Z' axis perpendicular to X' ; (3) the Y' axis remains the same as that of the LiDAR coordinate system. All pooled points’ coordinates p of the box proposal should be transformed to the canonical coordinate system as p˜ by proper rotation and translation. Using the proposed canonical coordinate system enables the box refinement stage to learn better local spatial features for each proposal.


Feature learning for box proposal refinement


the refinement sub-network combines both the transformed local spatial points (features) p˜ as well as their global semantic features f (p) from stage-1 for further box and confidence refinement. Although the canonical transformation enables robust local spatial features learning, it inevitably loses depth information of each object. For instance, the far-away objects generally have much fewer points than nearby objects because of the fixed angular scanning resolution of the LiDAR sensors. To compensate for the lost depth information, we include the distance to the sensor, i.e., d p (p) = (x (p)) 2 + (y (p)) 2 + (z (p)) 2, into the features of point p. For each proposal, its associated points’ local spatial features p˜ and the extra features [r (p) , m(p) , d(p) ] are first concatenated and fed to several fully-connected layers to encode their local features to the same dimension of the global features f (p) . Then the local features and global features are concatenated and fed into a network following the structure of [28] to obtain a discriminative feature vector for the following confidence classification and box refinement.

Losses for box proposal refinement


We adopt the similar bin-based regression losses for proposal refinement. A ground-truth box is assigned to a 3D box proposal for learning box refinement if their 3D IoU is greater than 0.55. Both the 3D proposals and their corresponding 3D ground-truth boxes are transformed into the canonical coordinate systems, which means the 3D proposal bi = (xi , yi , zi , hi , wi , li , θi) and 3D ground-truth box b gt i = (x gt i , y gt i , z gt i , hgt i , w gt i , lgt i , θgt i ) would be transformed to b˜i = (0, 0, 0, hi , wi , li , 0), (4) b˜gt i = (x gt i − xi , y gt i − yi , z gt i − zi , hgt i , w gt i , lgt i , θgt i − θi) The training targets for the ith box proposal’s center location, (bini ∆x , bini ∆z , resi ∆x , resi ∆z , resi ∆y ), are set in the same way as Eq. (2) except that we use smaller search range S for refining the locations of 3D proposals. We still directly regress size residual (resi ∆h , resi ∆w, resi ∆l ) w.r.t. the average object size of each class in the training set since the pooled sparse points usually could not provide enough information of the proposal size (hi , wi , li). For refining the orientation, we assume that the angular difference w.r.t. the ground-truth orientation, θ gt i − θi , is within the range [− π/ 4 , π /4 ], based on the fact that the 3D IoU between a proposal and their ground-truth box is at least 0.55. Therefore, we divide π 2 into discrete bins with the bin size ω and predict the bin-based orientation targets as

.



Therefore, the overall loss for the stage-2 sub-network can be formulated as



where B is the set of 3D proposals from stage-1 and Bpos stores the positive proposals for regression, probi is the estimated confidence of b˜i and labeli is the corresponding label, Fcls is the cross entropy loss to supervise the predicted confidence, L˜ (i) bin and L˜ (i) res are similar to L (p) bin and L (p) res in Eq. (3) with the new targets calculated by b˜i and b˜gt i as above. We finally apply oriented NMS with bird’s view IoU threshold 0.01 to remove the overlapping bounding boxes and generate the 3D bounding boxes for detected objects.


FINAL OUTPUT PREDICTION USING POINTRCNN

ON LIDAR DATA (KITTI DATASET)




Thanks For Reading till the end .. Keep Learning..!

60 views0 comments

Recent Posts

See All

Comments


bottom of page