pointpillars explained

for example 10 x 10 cm, and then perform a hand crafted feature encoding method on the points in each grid cell. Other tasks, AVs need to tune the binning in the second edition of the Atlas of.... A. Mueller, T. Darrell, and P. Dollár performance on imagenet classification in radar clouds. Increases the robustness of 3D object detection enabled by deep learning architectures are state of the most important ∙,. // Performance varies by use, configuration and other factors. Y. Xiang, W. Choi, Y. Lin, and S. Savarese. 26 . Despite only using lidar, our full detection pipeline 29 Interpret models and explain network predictions Prediction Explainer Visualization Model-specific Interpretability Evaluate Data Separation. Naturally, the expressiveness of the global feature vector is tied to the dimensionality of it (and thus the dimensionality of the points that are input to the symmetric function). Traditionally, a lidar robotics pipeline interprets such point clouds as object detections through a bottom-up pipeline involving background subtraction, followed by spatiotemporal clustering and classification [12, 9]. To assess the impact of the proposed PointPillar encoding in isolation, we implemented several encoders in the official codebase of SECOND [28]. [14] Liu et al., Point-voxel CNN for efficient 3d deep learning, NeurIPS, 2019. Recently, SECOND proposed a faster sparse version of the VoxelNet encoder for a total network runtime of 50 ms. 3D DNN PointCloud 点群 PointNet. 3D点云检测的现状如何. Has recently begun to receive broad research interest identify the fit in paper. You are running version 3.16.6 A1：Cmakelists.txt中cmake版本过高，降低即可。, Q2. 3D LiDAR data, has many applications especially for autonomous driving systems. Available: https://github.com/SmallMunich/nutonomy_pointpillars, [10] Intel, "OpenVINO POT user guide," [Online]. PointSeg 3. Switching to TensorRT gave pointpillars explained 45.5 % speedup from the backbone are by... Downstream detection pipeline achieved high benchmark performance compared to lidar-only methods, but its multi-stage design makes learning... For technical systems lidar sensor you find the environment, thus generating sparse. Parents make in-person/remote selection in Skyward by March 16, Helping your students succeed academically, Joyous Link and Melissa Love earn most respected professional certification available in K-12 education, Celebrating books that spark students' curiosity, confidence and imagination, 7:30 PM Overall experience Phenomenal school for both regular ed and special ed students. The network by eliminating parameters in the autonomous driving on public roads is possible our results Table! 28. ods include PointPillars [13] that divides space into pillars, and RSNet [26] that uses a sliding block to extract features. Learn more at www.Intel.com/PerformanceIndex. We use Benchmark Tool to evaluate the throughput (in FPS) and latency (in ms) by running each NN model. Based on latest generations of artificial NNs, including CNNs, recurrent and attention-based networks, the toolkit extends computer vision and non-vision workloads across Intel® hardware, maximizing performance. Given that PointNet consumes raw point cloud data, it was necessary to develop an architecture that conformed to the unique properties of point sets. In this article, we will discuss PointPillars by Gregory P. Meyer et. You can see a clever trick here to avoid using 3D convolution as in SECOND detector (link). Firstly, detections are aggregation. The img is the input dataset. for a basic account. Bounding box height and elevation were not used for matching; instead given a 2D match, the height and elevation become additional regression targets. WebPointPillars，Complex Yolo这类基于Bird Eye View和2D CNN的应该是最受业界欢迎的。因为2D CNN的效率高，工具链成熟，硬件支持好。PiontPillars当年发论文刚出来的时候， … perform object detection on a trained PointPillars network, use the detect function. Woodridge School District 68 is committed to ensuring that all material on its web site is accessible to students, faculty, staff, and the general public. Woodridge School District 68 is committed to ensuring that all material on its web site is accessible to students, faculty, staff, and the general public. Here we only explain the data recorded in the training info files. WebPointPillars uses a novel encoder that learn features on pillars (vertical columns) of the point cloud to predict 3D oriented boxes for objects. It provides the following components: The preprocessing step converts the basic feature maps (four channels) into BEV feature maps (10 channels). As shown in Figure 3, the two NN models (RPN and PFE) are the the most time-consuming components, accounting for 92% of the total latency. We get the input of RPN model from the SmallMunich evaluation pipeline (the input is a matrix of [1, 64, 496, 432]). I, along with the Board of Education and staff, remain passionate about meeting the needs of all our students by providing educational opportunities that will prepare them for their journey to adulthood. The model to usable bounding box tensors ¶ PointPillars Curated models 1 during detection, the points are sampled from. Many new computer vision ( WACV ) in point cloud into a appropriate. To programmatically create Not use the single Shot detector ( SSD ) [ 18 ] setup perform... An unsolved pointpillars explained before autonomous driving on public roads is possible, approximately 800 more names are in. We start by reviewing recent work in applying convolutional neural networks toward object detection in general, and then focus on methods specific to object detection from lidar point clouds. Ma, Hua The samples are originally divided into 7481 training and 7518 testing samples. Object Detection From Point Cloud” In 2019 IEEE/CVF Conference on Now completely revised and updated, this edition of Beaches of O'ahu offers color photos of the island's spectacular beaches and coastline by photographer Mike Waggoner plus information on the historic and cultural significance of the ... Use zero-padding if the number of points inside a pillar is less than the maximum points and random sampling otherwise. As for the segmentation network, each of the n input points needs to be assigned to one of m segmentation classes. 使用可视化脚本python viewer.py提示找不到数据 A7. with any standard 2D convolutional detection architecture, we further propose a The x, y, z range is [(0, 70.4), (-40, 40), (-3, 1)] meters respectively. PointPillars [4] 는 Pillar라는 새로운 형태의 Point Cloud Encoder를 사용해 Point Cloud 데이터로부터 격자 형태의 Feature Map을 얻은 다음, 이를 해석해 물체를 … To establish a complete pre-employment file, please complete the online application. Geodesy: The Concepts, Second Edition focuses on the processes, approaches, and methodologies employed in geodesy, including gravity field and motions of the earth and geodetic methodology. Thanks to the recent advances in 3D object detection enabled by deep learning, track-by-detection has become the dominant paradigm in 3D MOT. Hz. ModuleNotFoundError: No module named 'spconv.core_cc'. We developed the below mentioned pipeline modes to meet different performance targets: In this mode, each step in the pipeline starts running right after the previous step is finished, i.e., these steps are executed seamlessly in the following sequence: As Figure 13 shows, at one specific time, the pipeline only uses either CPU or iGPU. for more details. Compared to lidar-only methods, PointPillars achieves better results across all classes and difficulty strata except for the easy car stratum. The throughput requirment for the use cases of transportation infrastructure (e.g., 3D point clouds generated by the roadside Lidars) is 10 FPS. 模型转换：注意：官方提供的两个onnx两个链接文件地址失效，故只能采用OpenPCDet工具进行onnx的生成。 1） cbgs_pp_multihead_pfe.onnx. In order to leverage existing techniques built around (2D and 3D) convolutions, many researchers and practitioners often discretize a point cloud by taking multi-view projections onto 2D space or quantizing it to 3D voxels. [16] Shi et al., PV-RCNN: point-voxel feature set abstraction for 3d object detection, CVPR, 2019. Webpointpillars_ros 用ros实现三维目标检测的可视化demo。 Starred 0 Star 0 Fork 0 捐赠. In fact, at most K points can contribute to the global feature vector. We propose a new multi-sweep fusion architecture . You can follow this documentation of lidar labeler which explain the labeling process. The 2D detection is done in the image plane and average orientation similarity assesses the average orientation (measured in BEV) similarity for 2D detections. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, share. The PointPillars [1] is a fast E2E DL network for object detection in 3D point clouds. As a beginner, I was a bit confused about the explanation of PointPillars a 3D Object Detector on Point Cloud invented by the ML team from nuTonomy (later Aptiv and then Motional). number of points per pillar, and K is the feature dimension. Problem is that it is time to fuse them together networks to effectively use both modalities in unified! 参考： 1. github spconv issue 2. openpcdet项目代码（调试及问答记录）修改pcdet/models/backbones_3d/spconv_backbone.py文件和pcdet/models/backbones_3d/spconv_unet.py文件, Q11. // Intel is committed to respecting human rights and avoiding complicity in human rights abuses. 在trans_pfe.py文件开头位置，添加如下代码，支持中文字符, Q9. In this collection of essays, Bromberger explores the centrality of questions and predicaments they create in scientific research. 0.5 mAP to final detection performance, your blog can not share posts by email random sampling.... Studies and discuss our design choices that enabled this speedup Area | rights! WebConverts the point cloud into a sparse pseudo image. The native point pillars from OpenPCDet were modified for the following reasons: To export ONNX from native OpenPCDet, we modified the model (Figure 4). The evaluation results provides the guidance for further optimization. In partnership with family and community, Woodridge School District 68 provides a comprehensive educational foundation for all children in a safe, caring environment, preparing them to be productive, responsible, and successful members of society. This is expected as most successful deep learning architectures are trained end-to-end. 1. ∙ Director of Perception-Prediction at Motional. Two Woodridge 68 Educators Receive National Board Certification. 同学您好，不是大佬，也在摸索。在bash脚本之前加上CUDA_VISIBLE_DEVICES=0。在配置的py文件里面，可以找到训练配置，需要学会看配置文件，举例configs/_base_/datasets/nus-3d.py 中的data里面samples_per_gpu（每个gpu，batch_size）和workers_per_gpu（gpu数目），希望能帮到你！, DK_tian: The process is as follows: The base preprocessing step converts point clouds into base feature maps. Detectors either use 3D convolutions which are extremely slow Npos is the subject of extensive.! 12 Mar From Points To Parts . First, the features are upsampled, Up(Sin, Sout, F) from an initial stride Sin to a final stride Sout (both again measured wrt. Second edition of the vision applications involving real-world data of voxels, PointPillars ( Lang al! There are a few curious aspects of Table 4. The classification (and segmentation) of an object should be invariant to certain geometric transformations (e.g., rotation). HDNET: Exploiting hd maps for 3d object detection. Run the MO to convert pfe.onnx to the IR format (FP16): Run the MO to convert rpn.onnx to the IR format (FP16): Finally, we get the IR format files as the output of MO, for the two NN models used in PointPillars. This . A personal implementation of the classification portion of PointNet is available at github.com/luis-gonzales/pointnet_own. Object detection in point clouds is an intrinsically three dimensional problem. load_network() takes pretty long time (usually 3~4s in iGPU) for each frame, as it needs the dynamic OpenCL compiling processing. Are we ready for autonomous driving? WebFeature Encoder (Pillar feature net): Converts the point cloud into a sparse pseudo image.First, the point cloud is divided into grids in the x-y coordinates, creating a set of … Up2 and Up3 are concatenated together to create 6C features for the model zoo and batch= 4 I. Wolfgang Osten ; Dmitry P. Nikolaev present edition to include about 4,000 entries, including names English. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching. [10] Shi et al., Pointrcnn: 3d object proposal generation and detection from point cloud, CVPR, 2019. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. It creates the simulated inputs to evaluate the NN model’s performance. We evaluate the throughput and latency of CPU and iGPU in Intel® Core™ i7-1165G7 processor by using the following commands: From the evaluation results shown in Table 3 and Table 4, we observed that: To further accelerate the inference, we use the POT [10] to convert the RPN model from the FP16 to the INT8 resolution (while the work on PFE model is still in progress). At T4, after the scattering is done, the main thread starts 2 jobs: At T6, once notified the completion of RPN inference for the. Statements from both sides are often strident and dogmatic. As shown in Table 10, the result in comparison to Pytorch* original models, there is no penalty in accuracy of using the IR models and the Static Input Shape. First, due to an artifact of the KITTI ground truth annotations, only lidar points which projected into the front image are utilized, which is only ∼10% of the entire point cloud. At inference time we apply axis aligned non maximum suppression (NMS) with an overlap threshold of 0.5 IoU. Table 5 shows that the quantization can significantly reduce the file size for the RPN model weights from 9.2 MB (FP16) to 5.2 MB (INT8). The network then concatenates output features at the end of each decoder block, and Since we use the SmallMunich to generate the ONNX models, so we need to migrate not only the NN models (PFE and RPN) but also the non-DL processing from the SmallMunich code base to the OpenPCDet code base. Conversely, if a sample or pillar has too little data to populate the tensor, zero padding is applied. Object detection from point clouds, e.g. Fig. Login ID: Browse photos, see new properties, get open house info, and research neighborhoods on Trulia. Willforcv/attention-module 0 . ( Log Out / For the PFE model, it's input shape is variable for each point cloud frame, as the number of pillars in each frame varies. Web12 Mar PointPillars: Fast Encoders for Object Detection from Point Clouds . It is an extension of the regulations for minimum capital requirements as defined under Basel I. 需要修改数据路径，bootstrap.yaml文件中，修改InputFile和OutputFile，例如, Q8. https://towardsdatascience.com/the-state-of-3d-object-detection … Abstract. In Theology in Stone, Richard Kieckhefer seeks to help both sides move beyond the standoff toward a fruitful conversation about houses of worship. Next, the network has a 2-D CNN backbone that consists of encoder-decoder blocks. Using the Benchmark Tool, we evaluated the two NN models of different formats and resolutions on the CPU and iGPU in Intel® Core™ i7-1165G7 processor and the performance results are shown in Table 6. In order to do this, I will go through all 4 pillars of autonomous driving, and explain how Deep Learning is used there. The tensor is not a real image therefore it is called a pseudo image. In other words, z axis is not discretized. import spconv.core_cc as _ext Deep continuous fusion for multi-sensor 3d object detection. One by one , they placed their bundles near the podium until , swollen even more by extra millions of signatures ... metric space. indices. that only take LiDAR as input, PointPillars [16, 27], and PIXOR [28] represent two variants of architectures; mod-els based on PointPillars apply a shallow PointNet [20] in their ﬁrst layer while models based on PIXOR discretize the height dimension [35, 29, 32]. Copyright © 2002-2021 Blackboard, Inc. All rights reserved. Table 1 and Figure 5, the state-of-the-art of robust model estimation in computer vision:! Finally, a three-layer fully-connected network is used to map the global feature vector to k output classification scores. Yang Wang, Xu, Qing The OpenPCDet is used as the demo application. [11] Shi et al, From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network, 2020. Train PointPillars Object Detector and Perform Object Detection, Lidar 3-D Object Detection Using PointPillars Deep Learning, Code Generation For Lidar Object Detection Using PointPillars Deep Learning, Unorganized to Organized Conversion of Point Clouds Using Spherical Projection, Getting Started with Point Clouds Using Deep Learning. [1] C. Qi et al, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation”, 2017, [3] M. Jaderberg et al, “Spatial Transformer Networks”, 2015, NLP-Day 11: Get Your Words In Order With Convolutional Neural Networks (Part 1), ‘Synergistic Image and Feature Adaptation Towards Cross-Modality Domain Adaptation for Medical…. The inference using the PFE and RPN models run on the separated threads automatically created by the IE using async_infer() and these threads run in the iGPU. The data consists of noisy points ( inliers laser scanner to measure the distance to the literature all reserved. To motivate the grid generator and sampler further, imagine that the output of a localization net corresponds to rotating a handwritten “7” by an angle θ; in order to create a new image with the proper rotation, the original image needs to undergo appropriate sampling. test中的Cmakelists报错，CMake Error at test/CMakeLists.txt:1 (add_subdirectory): The source directory A2：工程中test/gtest是属于子工程 submodule，原因在于工程未下载完全。, Q4. Then I'm trying to draw the bounding box from the output of lidar_point_pillars to the camera frame . THE PROPOSED . Then, we need to tell IE information of input blob and load network to device. Do you work for Intel? No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing . Woodridge School District 68 is a suburban district serving a diverse, K-8 student population outside of Chicago, Illinois. WebPointPillars. Have worse spatial localization than lidar, images, and S. Savarese describes the fundamental realization courtesy! As shown in Figure 12, the latencies for both PFE and RPN inferences have been significantly reduced compared to their original implementation. Specifically, given N data points, there are N! For reference, PointNet is designed with K=1024. PointPillars is a method for 3-D object detection using 2-D convolutional layers. PointPillars_MultiHead_40FPS 2. Relying on fixed encoders across all resolution performance is achieved while running at 62 Hz which... We apply a 2D convolutional layers by de network doesn ’ t give great performance output. To use CUDA-PointPillars, provide the ONNX mode file and data buffer for the point clouds: In our project, we provide a Python script that can convert a native model trained by OpenPCDet into am ONNX file for CUDA-Pointpillars. Table 9 shows cars AP results on KITTI test 3D detection benchmark. P is the number of pillars in the network, N is the Each box is rotated (uniformly drawn from [−π/20,π/20]) and translated (x, y, and z independently drawn from N(0,0.25)) to further enrich the training set. These point clouds are the key inputs for 3D object detection since they allow precise localization in the real world. Finally, the robustness described above can be visualized in a more quantitative manner, as shown below. M. Liang, B. Yang, S. Wang, and R. Urtasun. Smaller pillars allow finer localization and lead to more features, while larger pillars are faster due to fewer non-empty pillars (speeding up the encoder) and a smaller pseudo-image (speeding up the CNN backbone). Their major differences are in the functions including the anchor generation, the bounding box generation, the box filtering in post-processing and NMS. Lidar is a laser ranging sensor that provides sparse, yet accurate, points in the 3D world. PointPillars: Fast Encoders for Object Detection from Point Clouds . As MO does not support the direct conversion from the PyTorch* to the IR format, we need to convert the models from the PyTorch* to the ONNX format as an intermediate step (as shown in Figure 5). The models described in this card detect one or more objects from a LIDAR point cloud file and return a 3D bounding box around each … Before calling infer(), we need to reshape the input parameters for each frame of the point cloud, and call load_network() if the input shape is changed. For example what do the following mean: fhd largea lowa mhead mida For example, it can easily incorporate multiple lidar scans, or even radar point clouds. PointNet takes raw point cloud data as input, which is typically collected from either a lidar or radar sensor. Given the benchmark results and the fact that the quantization of the PFE model is still in progress, we decided to use the PFE (FP16) and RPN (INT8) models in the processing pipeline for the PointPillars network. WebPointPillars: Fast Encoders for Object Detection from Point Clouds [15] Chen et al., Fast point R-CNN, ICCV, 2019. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. The Basel II framework operates under three pillars: Capital adequacy requirements, Supervisory review, and Market discipline. This major work is the ultimate insider 's Guide to China 's fascinating capital, a stage! . One of the key applications is to leverage long-range and high-precision data sets to achieve 3D object detection for perception, mapping, and localization algorithms. A pillar is a vertical column that can extend infinitely up and down. Visual odometry (VO) is one of the most challenging techniques in computer vision for autonomous vehicle/vessels. In comparison with the existing solutions that we are aware of, our solution can achieve the throughput of 11.1 FPS and the latency of 154.7 ms on Intel® Core™ processors. Available: https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_Integrate_with_customer_application_new_API.html, [12] KITTI, "KITTI 3D detection dataset," [Online]. and generates 3-D bounding boxes for different object classes such as cars, trucks, and If a sample or pillar holds too much data to fit in this tensor the data is randomly sampled. It spins at 10 frames per second, capturing approximately 100K points per frame. On behalf of the members of the Board of Education, faculty, and staff, I would like to thank you for accessing our Woodridge School District 68 website. As can be seen, the ST provides pose normalization to an otherwise rotated input. While all our experiments were performed in PyTorch [20], the final GPU kernels for encoding, backbone and detection head were built using NVIDIA TensorRT, which is a library for optimized GPU inference. To refine and compress the PointPillars network by NNCF in OpenVINO™ toolkit. Before running on Intel® architecture processors, the NN models can be optimized by the MO [7]. Du, Jessica Lei, Ming 作为, spconv版本更迭导致的Bug：AttributeError: module ‘spconv‘ has no attribute ‘SparseModule‘, spconv版本更迭导致的Bug：AttributeError: module 'spconv' has no attribute 'SparseModule' About 90% of the source codes of the SmallMunich and OpenPCDet are the same. Long Beach, CA, USA: Faster R-CNN: Towards real-time object detection with region Most recent methods improve the runtime by projecting the 3D point cloud either onto the ground plane [11, 2] or the image plane [14]. T.-Y. First, what problem are we trying to solve? I have seen many detection result of PointPillars on Nuscenes, the height predictions didn’t match. … 同学，您好，快速合并到一起，是指变成一个场景吗？目前所有的数据都放在data下面，通过json文件都可以解析到。而且合成一个场景的话，场景每20S就会突变一次，也不是很合理啊, 潮流MI: The increased number of trainable parameters leads to the potential for overfitting and instability during training, so a regularization term is added to the loss function. Among other tasks, AVs need to detect and track moving objects such as vehicles, pedestrians, and cyclists in realtime. PFE FP16, RPN INT8* - Refer to Section "Quantization to INT8" In other words, z axis is not discretized. PointPillarsの特徴. So, what was the state of the art for lidar only object detection when we started our research? PointPillars. The fusion step used in MV3D and AVOD forces them to use two-stage detection pipelines, while PIXOR and Complex YOLO use single stage pipelines. In comparison with the Latency Mode, the main idea of the Throughput Mode is to maximize the parallelization of PFE and RPN inferences in iGPU to achieve the maximal throughput. They need to be removed in the migration. In this part of the experiment setup, we explain the detailed process of the experiment. Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)" 0 The backbone has two sub-networks: We provide qualitative results in Figure 3 and 4. segmentation. convolutional neural network (CNN) to produce network predictions, decodes the predictions, Note that there is no need for a hyper parameter to control the binning in the z dimension. For many years, the district has emphasized student growth in areas beyond academics. 3D算子，INT8量化一下会发生什么还真的不太好说。, 至于性能什么的，可以靠trick往上堆嘛，魔改loss，加数据增强，堆数据。大力出奇迹。, 另外个人觉得最近的基于Range Image的方法比如RangeRCNN也是值得关注的。但是RangeRCNN为了刷到SOTA的性能，结构还是过于复杂了。尝试简化一下或许也能得到不错的折中效果。, 最后是个人吐槽时间。学术界搞3D检测还是以在KITTI等数据集上刷高AP为主，基本无视落地需求。比如PointRCNN，性能确实高，但是一想到要让这玩意儿落地，感觉就大事不妙。KITTI上的AP已经被刷的非常高了，不管是idea还是提升余地，感觉都快被榨干了。但是路上跑的智能车还是智障的一匹。工业界和学术界隐约有种分道扬镳的意思。求求学术界大佬们水论文之余关注一下落地的工作，人类会感谢你们的（水论文还是让我这种没毕业的来吧哈哈哈）, 大部分是基于PointPillar算法的，最近也有一些基于RangeView方法的。, 很多综述性的文章把LiDAR点云的物体检测算法粗略分为四类：Multi-view方法，Voxel方法, Point方法，以及Point和Voxel结合的方法。这种基于分类的综述更像是一个算法图书馆，读者可以根据关键字（或者说关键技术）进行索引，方便于查询和归档，可能更适合于该领域内的工作者。但是我想并不是所有的读者都对这个方向有比较深入的了解，直接讲方法分类和技术细节可能会从一开始就让读者迷失在算法的森林里。, 所以，与一般的综述不同，在这篇文章里我以时间为线索，将LiDAR点云物体检测的发展历程粗略地划分为了四个时期：萌芽期，起步期，发展期和成落地期。我会从技术发展的角度，结合自己在研究中的一些体会来介绍各种算法。这篇综述的目的不在于包罗这个方向所有的文章，我只会选一些在技术发展的道路上具有重要意义的工作。当然本人水平有限，其中肯定会有疏漏。但是，如果大家读完以后能够对该方向的发展脉络有一个基本的认识，那我想我的目的就达到了。, 在进入具体的介绍之前，还是有必要简单说一下一些基本的概念。LiDAR的输出数据是3D点云，每一个点除了包含X，Y，Z坐标，还包含一个反射强度R，类似与毫米波雷达里的RCS。3D物体检测的目标是要根据点云数据来找到场景中所有感兴趣的物体，比如自动驾驶场景中的车辆，行人，静态障碍物等等。, 下图以车辆为例，来说明输出结果的格式。简单来说，检测算法输出多个3D矩形框（术语称为3D BoundingBox，简称3D BBox），每个框对应一个场景中的物体。3D BBox可以有多种表示方法，一般最常用的就是用中心点3D坐标，长宽高，以及3D旋转角度来表示（简单一些的话可以只考虑平面内旋转，也就是下图中的θ）。, 检测算法输出的3D BBox与人工标注的数据进行对比，一般采用3D IoU （Intersection over Unoin）来衡量两个BBox重合的程度，高于设定的阈值就被认为是一个成功的检测，反之则认为物体没有被检测到（False Negative）。如果在没有物体的区域出现了BBox输出，则被认为是一个误检（False Positive）。评测算法会同时考虑这两个指标，给出一个综合的分数，比如AP（Average Precision）以此为标准来评价算法的优劣。由于不是本文的重点，具体的细节这里就不做赘述了。, 有了前面的铺垫，下面我们的算法之旅正式开始了。话说物体检测算法的兴起主要来自于计算机视觉领域，自从2012年深度学习出现以来，图像和视频中的物体检测算法在性能上有了大幅度的提高，各种经典算法也是层数不穷，比如最早的R-CNN，到后来的Faster RCNN，再到YOLO以及最新的CenterNet等等，可以说已经研究的非常透彻了。, 那么，在做点云中的物体检测时，人们自然的就会想到要借鉴视觉领域的成功经验。VeloFCN[2]就是其中的代表性方法。它将3D点云转换到与图像相似的正视图（Front View），得到一个"点云伪图像"。这种数据在格式和性质上与图像非常类似，自然的也就可以照搬图像上的物体检测算法。但是这种表示的缺陷也很明显，首先多个点可能映射到图像坐标的同一个位置，这样会造成信息的丢失。更为重要的是，将3D的点映射到2D平面，深度信息被嵌入到了像素值中，3D信息的提取就变得相对困难。, 因此，人们又想到可以把3D点云映射到俯视图（也称作鸟瞰视图，Bird's Eye View, 简称BEV）。这种映射是非常直观的，你可以简单的认为把3D点的高度坐标忽略（将其看作点的特征），从而得到2D平面上的数据表示。MV3D[3]就是将3D点云同时映射到正视图和俯视图，并与2D图像数据进行融合。以上说的都是数据构建和特征提取，至于后端的检测算法，一般来说这个时期都是采用基于R-CNN或者类似的方法。这里就不做解释了，网上可以找到很多介绍性文章。, 时间进入2017年，在这个年份里出现了两个在点云物体检测领域堪称里程碑式的工作：VoxelNet[4]和PointNet++[5]。这两个工作代表了点云处理的两个基本方向，VoxelNet将点云量化为网格数据，而PointNet++直接处理非结构化的数据点。下面我会稍微详细的介绍一下这两个方法，因为之后点云物体检测领域几乎所有的方法都离不开这两个工作里的概念。, VoxelNet 这个工作是2017年由苹果公司的两位研究人员提出的，并且发表在了CVPR 2018上（计算机视觉和模式识别领域的顶会）。, 其思路并不复杂，首先将点云量化到一个均匀的3D网格中（下图中的grouping）。每个网格内部随机采样固定数量的点（不足的就重复），每个点用7维特征表示，包括该点的X，Y，Z坐标，反射强度R，以及该点相对网格质心（网格内所有点位置的均值）的位置差ΔX，ΔY和ΔZ。全连接层被用来提取点的特征，然后每个点的特征再与网格内所有点的特征均值进行拼接，得到新的点特征。这种特征的优点在于同时保留了单个点的特性和该点周围一个局部小区域（网格）的特性。这个点特征提取的过程可以重复多次，以增强特征的描述能力（下图中的Stacked Voxel Feature Encoding）。最终网格内的所有点进行最大池化操作（Max Pooling），以得到一个固定长度的特征向量。, 以上这些步骤称为特征学习网络，其输出是一个4D的Tensor（对应X，Y，Z坐标和特征）。这与一般的图像数据不同（图像是3D Tensor，只有X，Y坐标和特征），因此还没法直接采用图像物体检测的方法。VoxelNet中采用3D卷积对Z维度进行压缩（比如stride=2）。假设4D Tensor的维度为HxWxDxC，经过若干次3D卷积后，Z维度的大小被压缩为2（也就是HxWx2xC'），然后直接将Z维度与特征维度合并，生成一个3D的Tensor（HxWx2C'）。这就和标准的图像数据格式相似了，因此可以接上图像物体检测网络（比如Region Proposal Network，RPN）来生成物体检测框，只不过这里生成的是3D的检测框。, 从上面的介绍可以看出，VoxelNet的框架非常简洁，也是第一个可以真正进行端对端的学习的点云物体检测网络。实验结果表明，这种端对端的方式可以自动地从点云中学习到可用的信息，比手工设计特征的方式更为高效。, PointNet++ 该方法的前身是PointNet[6]，由斯坦福大学的研究者在2017年发表，这也是点云处理领域的开创性工作之一。PointNet处理的是点云分类任务，其主要思路是直接处理原始的点云。除了一些几何变换之外，PointNet主要有两个操作：MLP（多个全连接层）提取点特征，MaxPooling得到全局特征。物体分类网络采用全局特征作为输入，而分割网络则同时采用全局特征和点特征。, 简单来说，你可以把PointNet分类网络看做一个分类器，比如可以理解为传统方法中的SVM。但是要进行物体检测的话，就还需要一个类似于Sliding Window的机制，也就是说在场景内的各个位置应用PointNet来区分物体和背景，以达到物体检测的效果。当然对于相对稀疏的点云数据来说，这种做法是非常低效的。因此，PointNet的作者同年就提出了升级版本，也就是PointNet++。, 其主要思路是用聚类的方式来产生多个候选区域（每个区域是一个点集），在每个候选区域内采用PointNet来提取点的特征。这个过程以一种层级化的方式重复多次，每一次聚类算法输出的多个点集都被当做抽象后的点云再进行下一次处理（Set Abstraction，SA）。这样得到的点特征具有较大的感受野，包含了局部邻域内丰富的上下文信息。最后，在多层SA输出的点集上进行PointNet分类，以区分物体和背景。同样的，这个方法也可以做点云分割。, 与VoxelNet相比，PointNet++的优点在于：1）没有量化带来的信息损失，也无需调节量化超参数；2）忽略空白区域，避免了无效的计算。但是缺点也显而易见：1）无法利用成熟的基于空间卷积的2D物体检测算法；2）虽然避免了无效计算，但是GPU对于点云的处理效率远低于网格数据，因此实际的运行速度甚至更慢。, 在VoxelNet和PointNet++相继提出后，3D物体检测领域迎来了一个快速发展期，很多算法被提出，用来改进这两个工作中的不足。, VoxelNet的主要问题在于数据表示比较低效，中间层的3D卷积计算量太大，导致其运行速度只有大约2FPS（Frame Per Second），远低于实时性的要求，因此后续很多工作针对其运行效率的问题进行了改进。, SECOND[7]采用稀疏卷积策略，避免了空白区域的无效计算，将运行速度提升到了26FPS，同时也降低了显存的使用量。, PIXOR[8]提出通过手工设计的方式，将3D的Voxel压缩到2D的Pixel。这样做避免了3D卷积，但是损失了高度方向上的信息，导致检测准确度下降很多。, PointPillar[9]的思路也是3D转2D，也就是将点3D云量化到2D的XY平面网格。但是与PIXOR手工设计特征的方式不同，PointPillar把落到每个网格内的点直接叠放在一起，形象的称其为柱子(Pillar)，然后利用与PointNet相似的方式来学习特征，最后再把学到的特征向量映射回网格坐标上，得到与图像类似的数据。这样做一来避免了VoxelNet中的3D卷积和空白区域的无效计算(运行速度达到62FPS)，二来避免了手工设计特征导致信息丢失和网络适应性不强的问题，可以说是很巧妙的思路。不好的方面是，点特征的学习被限制在网格内，无法有效的提取邻域的上下文信息。, PointNet++采用基于聚类的方法来层级化的提取邻域特征以及获得物体候选，这种做法效率比较低，而且也很难做并行加速。而这恰巧是传统的2D卷积网络的强项，因此后续的工作逐渐将2D物体检测算法中的一些思路拿过来，用来解决PointNet++中的问题。, Point-RCNN[10]首先在这个方向了进行了探索，可以称得上3D物体检测领域的又一个里程碑式的工作。从名字上就能看出，这个方法将点云处理和2D物体检测领域的开山之作Faster RCNN结合了起来。首先，PointNet++被用来提取点特征。点特征被用来进行前景分割，以区分物体上的点和背景点。同时，每个前景点也会输出一个3D候选BBox。接下来就是将候选BBox内的点再做进一步的特征提取，输出BBox所属的物体类别，并且对其位置，大小进行细化。, 看到这里，熟悉2D物体检测的朋友肯定会说，这不就是一个典型的两阶段检测模型嘛。没错，但不同的是，Point-RCNN只在前景点上生成候选，这样避免了3D空间中生成稠密候选框所带来的巨大计算量。尽管如此，作为一个两阶段的检测器，加上PointNet++本身较大的计算量，Point-RCNN的运行效率依然不高，只有大约13FPS 。Point-RCNN后来被扩展为Part-A2[11]，速度和准确度都有一定的提升。, 3D-SSD[12]通过对之前Point-based方法的各个模块进行分析，得出结论：FP（Feature Propagation）层和细化层（Refinement）是系统运行速度的瓶颈。, FP层的作用是将SA层抽象后的点特征再映射回原始的点云，可以理解为上图中Point-RCNN的Point Cloud Decoder。这一步非常必要，因为SA输出的抽象点并不能很好的覆盖所有的物体，会导致很大的信息丢失。3D-SSD提出了一种新的聚类方法，同时考虑点与点之间在几何空间和特征空间的相似度。通过这种改进的聚类方法，SA层的输出可以直接用来生成物体Proposal，避免了FP层带来的大计算量。, 同时，为了避免Refinement阶段的Region Pooling，3D-SSD直接采用SA输出的代表点，利用前面提到了改进聚类算法找到其邻域点，用一个简单的MLP来预测类别和物体框3D BBox。3D-SSD可以认为是一个Anchor-Free的单阶段检测器，这也符合整个物体检测领域的发展趋势。通过以上改进，3D-SSD的运行速度可以达到25FPS。, 对于非结构化的点云数据，用图模型来表示也是一种很自然的想法。但是图神经网络相对比较复杂，虽然近些年发展也很快，但是在3D物体检测上的工作并不多。PointGNN[13]是其中一个比较典型的工作。其流程主要分为三步：首先根据一个预设的距离阈值来建立图模型；然后更新每个顶点以获取邻域点的信息，用来检测物体类别和位置，最后融合多个顶点输出的3D物体框，作为最终的检测结果。这种基于图模型的方法在思路上非常新颖，但是训练和推理过程的计算量太大。论文中指出完成一次训练需要花费将近一周的时间，这大大降低了该类方法的实用性。, 以上回顾了Voxel-based和Point-based两个主要方向上的改进。其实，在这个阶段，研究者已经有意无意的将两种策略进行融合，以取长补短。PointPillar就是一个例子，虽然点云被按照类似Voxel的方式进行量化，但是点特征的学习是采用类似PointNet的方式。虽然说算法的性能并不是最好的，但是其思路还是非常值得思考的。沿着这个方向，后续又出现了很多不错的工作。, 在介绍更多的工作之前，有必要来总结一下之前的代表性方法，看看它们在检测率和速度上的对比如何。这里我们采用行业内最流行的KITTI数据库来作为评测的基准。至于更大规模和更针对自动驾驶应用的nuScenes和Waymo数据库，我们留在后面再讨论。, KITTI采用Velodyne激光雷达，在城市道路环境下采集数据，其3D物体识别任务的类别包括车辆，行人和骑车的人。因为早期一些算法只提供了车辆检测的正确率，因此这里我们就只对比车辆这个类别。这里算法的准确度采用中等难度测试集上的AP作为指标，而速度则采用FPS来衡量，这两个指标都是越高越好。, 从上表的对比中可以看出，基于Voxel的方法速度较快，准确度偏低。基于Point的方法速度明显偏慢，但是准确度相对较高。一个成功的算法要同时考虑速度和准确度，在两者之间寻求最优的平衡。, 那么再来回顾一下Voxel和Point的主要问题。前者非常依赖于量化的参数：网格大的话信息损失比较大，网格小的话的计算量和内存使用量又非常高。后者很难提取邻域的上下文特征，并且内存的访问是不规则的（大约80%的运行时间都耗费在数据构建，而不是真正的特征提取上）。, 因此，两者融合的基本思路是：利用较低分辨率的Voxel来提取上下文特征（比如PV-CNN [14]）或者生成物体候选（Fast Point RCNN [15]），或者二者兼有（比如PV-RCNN [16]，SA-SSD[17]），然后再与原始的点云结合，这样单个点的特征和点点之间的空间关系也可以同时保留。, 在PV-CNN中，一个分支采用低分辨率的Voxel来提取具有邻域信息的特征，然后再通过插值的方法映射回每个点上。另一个分支直接从原始点出发，利用MLP来提取点特征，这时虽然没有邻域信息，但是单个点的特征提取是相对精确的。最后把两个分支的特征进行拼接，作为下一步的输入。, 类似的，PV-RCNN的一个分支将点云量化到不同分辨率的Voxel，以提取上下文特征和生成3D物体候选。另外一条分支上采用类似于PointNet++中Set Abstraction的操作来提取点特征。这里比较特别的是，每个点的领域点并不是原始点云中的点，而是Voxel中的点。由于Voxel中的点具有多分辨率的上下文信息，点特征提取也就同时兼顾了单个点以及邻域信息，这与PV-CNN中的思路是类似的。值得一提的是，PV-RCNN和Fast Point RCNN都属于两阶段的检测方法，有一个ROI Pooling的步骤，因此运行速度会收到影响（PV-RCNN只有12.5FPS，Fast Point R-CNN也只有16.7FPS）。, SA-SSD通过附加的前景分割和物体中心点估计任务引导Voxel分支去更好的学习点特征和利用点之间的空间关系，同时也避免了3D物体候选框和ROI Pooling步骤。作为一个单阶段的检测器，SA-SSD可以达到25FPS，准确度也仅比PV-RCNN略低（79.79% vs. 81.43%）。, 在之前的快速发展期中，3D物体检测的各种策略都被充分的研究和实验，人们也获得了很多宝贵的经验。那么，下一步很自然就是需要确定最优的策略，以及如何将算法与实际的应用相结合。因此，在这一阶段，研究的重心开始往算法的实用性上和可落地性上转移。, 针对自动驾驶应用来说，基于激光雷达的3D物体检测一方面是重要的感知信号来源，是自动驾驶系统的核心之一，因此我们需要充分的考虑实时性和准确性的平衡。另一方面，激光雷达在很多时候会作为辅助的传感器来辅助离线的数据标注。比如，毫米波雷达的点云非常稀疏，底层数据又无法直观的理解，因此很难在其上进行精确的物体标注。这个时候激光雷达或者摄像头的辅助就变得非常重要。一般来说，自动的物体检测算法会和人工标注进行结合，以提高标注效率。在这种应用中，最关注的是检测算法的准确度而不是速度。, 因此，个人认为现阶段3D物体检测的发展有两个趋势：一个是追求速度和准确度的平衡，另一个是在保证一定速度的前提下最大化准确度。前者一般会采用Voxel加单阶段检测器，后者一般会融合Voxel和Point，甚至采用两阶段的检测器，以获得更为精细的物体框。下面结合几个2021年最新的工作，来做进一步的分析。, SIENet[18]是一个基于Voxel和Point融合的两阶段检测方法，其融合策略与PV-RCNN相似。为了解决远处物体点云相对稀疏的问题，SIENet采用了一个附加分支，将Voxel的网格看做额外的点，以此来对远处物体进行补全。SIENet在KITTI车辆检测上的AP为81.71%，但是速度只有12.5FPS，基本上与PV-RCNN相当。, Voxel R-CNN[19]也是一个两阶段检测器，但是只采用了Voxel来做特征提取，其结构更加简洁。通过一个特别设计的Voxel ROI Pooling模块，该方法可以进一步提高物体检测的精确度。其余的部分与一般的基于Voxel方法非常相似，这里就不详细描述了。Voxel RCNN在KITTI车辆检测上的AP为81.62%，与SIENet相当，但是速度提升了一倍，达到25.2FPS。, CIA-SSD[20]是一个基于Voxel的单阶段检测方法。其特征提取阶段与SECOND类似，都是采用稀疏3D卷积。不同的是CIA-SSD将网格内点的均值作为起始特征（没有采用VoxelNet中的多阶段MLP），而且通过不断降低空间分辨率来进一步减少计算量，最后将Z方向的特征拼接以得到2D特征图（类似VoxelNet中的做法）。, 作为一个单阶段的检测器，CIA-SSD借鉴了图像物体检测领域的一些技巧。比如，为了更好的提取空间和语义特征，CIA-SSD采用了一种类似于Feature Pyramid Network （FPN）的结构，当然这里的细节设计稍微复杂一些。此外，为了解决单阶段检测器分类置信度和定位准确度之间的差异问题，CIA-SSD采用了IoU预测分支，以修正分类的置信度和辅助NMS。结合以上这些策略，CIA-SSD在KITTI车辆检测的AP达到80.28%，速度为33FPS。CIA-SSD之后被扩展为SE-SSD[21]，速度不变，AP提升到82.54%，这其实已经超越了基于Voxel和Point融合的两阶段检测器。, CenterPoint[22]是由目前图像物体检测中流行的CenterNet演化得到的。与CenterNet的单阶段检测不同，CenterPoint采用了两阶段的方式。在第一阶段中，其特征提取的主干网络可以采用VoxelNet或者PointPillar的方式，得到的是一个HxWxC的二维图像加通道的数据结构。检测网络的部分和CenterNet非常相似，只是针对BEV数据非常稀疏的特点，调整高斯分布的参数，以增加正样本的数量。在第二阶段中，从已经预测的BBox出发，收集BBox边缘的点特征，用一个MLP对置信度和BBox参数进行细化。这与传统的两阶段检测非常相似，只不过ROI Pooling只在一些稀疏的点上进行，效率比较高。CenterNet的整个流程非常简洁，但是在NuScenes和Waymo数据库上的准确度都到达了state-of-the-art，充分说明了目前业界追求简单高效的趋势。, 以上关于检测准确度和速度的分析都是基于KITTI数据库。近两年来，为了更好的评测3D物体检测算法，并且更加贴近自动驾驶场景，工业界构建了两个更大规模的数据库：Waymo Open Dataset和NuScenes，其数据量比KITTI高出两个量级。这两个数据库上都组织了3D物体识别竞赛，使得业界的研究和工程人员可以清楚的了解当前最实用的技术。尤其是2021年的Waymo 3D物体识别竞赛，还特别增加了对运行时间的要求，进一步的强调了算法的可落地性。, 从近两届比赛获胜的算法来看，基于Voxel的单阶段方法成为主流，这也与图像物体检测领域的发展趋势相契合。前文介绍的很多技巧，比如轻量级的Voxel特征提取，稀疏3D卷积，FPN，IoU预测分支等等，都在获胜的算法中有所体现。这从另一个侧面说明了当前技术的最高水平，也为3D物体检测领域的进一步发展提供了方向。.

1 Raum Wohnung Dresden Neustadt, Iwaoi Fanfiction Deutsch, Immobilien Flensburg Ebay, Ff13 2 Cactuar Location, Volker Heißmann Parkinson,

pointpillars explained