Mono3D Detection 整理


本文整理了单目3D目标检测的相关工作

0.Introduction of 3D Object Detection

在缺乏深度信息的情况下,根据RGB图像恢复3D结构是一个不适定问题。但是在提供先验信息的情况下,这个任务也能取得不错的效果。
在具体的方案实现上,总体可以分为四条路线:

  1. Proposal-based的方案,先生成3D proposal ,再根据设定的方案对proposal进行排序,从而得到最后的结果
  2. RGB-only的方案,将3D框的各组成项解耦,直接从RGB中回归各组成项再组合成3D框
  3. Perspective transformation的方案,将原始的输入图像换一种表达形式(此处不包含借助depth进行3D结构恢复的方案)
  4. Pseudo-LiDAR的方案 ,本质上都是在RGB图像上进行深度估计,恢复3D结构,然后在视觉点云上进行3D检测

以下将一些代表性的论文粗略分为上述四个类别,按照时间线进行简单的回顾。

1.Proposal-based

2016—CVPR—Mono3D

[Paper] [Code]

Mono3D first samples candidates based on the ground prior and scores them with semantic/instance segmentation, contextual information, object shape, and location prior.
MonoFlex

在3D空间中生成proposal,投影回2D后结合分割、形状、位置先验进行评分,最后保留的作为结果。

2019—ICIP-Shift RCNN

[Paper] [Code]

avoids dense proposal sampling by “actively” regressing the offset from the Deep3DBox proposal. They feed all the known 2D and 3D bbox info into a fast and simple fully connected network called ShiftNet and refine the 3D location

2019—ICCV—MonoDIS

[Paper] [Code]

Instead of directly supervising each of the components of 2D and 3D bbox output, it takes a holistic view of bbox regression and uses 2D (signed) IoU loss and 3D corner loss. These losses are hard to train in general, so it proposes a disentanglement technique by fixing all elements but one group (including one or more elements) to ground truth and calculate the loss, essentially training only parameters in that group.

2019—CVPR—MonoPSR

It generates 3D proposal first and then reconstructing local point cloud of dynamic objects.
The reconstruction branch regresses a local point cloud of the object and compares it with the GT in point cloud and camera (after projection).

2020—CVPR—D4LCN

took the idea of depth-aware convolution from M3D-RPN even further by introducing a dynamic filter prediction branch. This additional branch which takes in the depth prediction as input and generates a filter feature volume, which generates different filters for each specific location in terms of both weights and dilation rates.

RGB Image only

基于RGB图像直接得到3D框通常是通过借助关键点、目标形状、2D-3D的几何一致性约束等来进行3D框的解耦从而得到最后的结果。
关注的目标大尺寸大多有相似的尺寸,这个信息对于估计到目标的距离有很大的帮助。
很多方法延续2D的方法来预测关键点。

2016—CVPR—Deep3DBox

对于3D框的回归,分成了三个分支。一个分支预测目标框的size相当于一类目标的均值的偏差,另外两个分支用于预测角度。角度的回归使用的是MultiBin的方法,预测每个bin的置信度以及角度的正弦和余弦。得到2D的结果和3D的结果后,使用最小二乘法通过最小化重投影误差计算目标框的位置。
缺点在于非常依赖于2D框的精度。
image.png

2017—CVPR—Deep MANTA

本文是这一方案的先驱。首先使用级联的faster RCNN的架构来回归2D框、分类和模板的相似度。然后使用预先选择好的3D CAD模型,使用EpnP算法进行2D/3D的匹配。

2018—IROS—The Earth ain’t Flat

本文目标在陡峭和分级的道路上对车辆进行单目重建。一个关键设计在于从单目图像中估计3D shape和6自由度。本文参考Deep MANTA进行3D model的匹配,但不是从所有可能的 3D 形状模板中挑选出最好的一个,而是使用基本向量和变形系数来捕捉车辆的形状。

2019—CVPR—RoI-10D

10D=6DoF pose + 3DoF size + 1D shape space

2019-AAAI-MonoGRNet

[Paper] [Code]
本文是由多个子网构成的统一网络,包括2D目标检测、实例深度估计、3D定位和局部角回归。
pixel-wise的深度估计最小化的是整张图像的所有像素得到一个平均最优估计,一些只占据了很小面积的目标就会被忽略了。
本文的核心思想是将3D定位问题解耦成几个循序渐进的子任务。
本文回归的是3D中心点的投影和粗糙的实例深度,然后使用两者估计粗糙的3D位置。它强调了2D框的中心点和3D框中心点在2D图像上的投影的不同。
本文没有直接回归更好观察的角度,而是回归了8个顶点相对于3D中心点的偏移。

2019-ICCV-MVRA

It introduces a 3D reconstruction layer to lift 2D to 3D instead of solving an over-constrained equation, with two losses in two different spaces: 1) IoU loss in perspective view, between the reprojected 3D bbox and the 2d bbox in IoU, and 2) L2 loss in BEV loss between estimated distance and gt distance.
It recognizes that deep3DBox does not handle truncated box well, as not four sides of the bounding box now correspond to the real physical extent of the vehicle.

2019—CVPR—CenterNet—Objects as Points

[Paper] [Code]
本文认为许多方法列出全部潜在的目标框是一个计算冗余且不够高效的方案,本文将目标视作single point,利用关键点估计来回归每个目标框的中心点及其其他属性,如size、3D位置、朝向、甚至姿态。
在head部分的设计,是通过heatmap得到目标的位置,因此不需要NMS之类的后处理操作,也无需进行目标框的组合。对比anchor-based的方案来说,CenterNet要简单方便许多。

GPP—Ground Plane Polling

本文使用3D框标注生成虚拟的2D关键点。

2020—RTM3D—Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving

This article uses virtual keypoints and use CenterNet-like structure to directly detect the 2d projection of all 8 cuboid vertices + cuboid center. The paper also directly regresses distance, orientation, size. Instead of using these values to form cuboids directly, these values are used as the initial values (priors) to initialize the offline optimizer to generate 3D bboxes.

2020—CVPR—MonoPair—MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships

MonoPair considers the pair-wise relationships between neighboring objects, which are utilized as spatial constraints to optimize the results of detection.

有针对occluded目标进行处理
本文考虑了成对的目标之间的空间关系,加入了uncertainty的考虑。
Homography Loss for Monocular 3D Object Detection是本文的衍生方案,都使用了graph的方案,考虑了多目标之间的空间约束

2020—CVPRW—SMOKE

本文完全消除了 2D bbox 的回归并直接预测 3D bbox
显著降低了60米以内的distance error

2020—AAAI—Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation

The first work to clearly state that the estimation of the 2D projection of the 3D vertices is totally decoupled from the depth estimation.
It uses a similar method as RTM3D to regress the eight projected points of the cuboid, then uses vertical edge height as a strong prior to guide distance estimation. This generates a coarse 3D cuboid.

2021—CVPR—FCOS3D

Pesudo LiDAR and BEV

在透视视角下,对于纯视觉的方案存在着遮挡和尺度变化的挑战。一些方法通过转换输入图像的表示形式来应对这个问题。
基于Pesudo-LiDAR和BEV的方案实际上都是将输入图像换了一种表示方法,近年来的主流BEV方法也多是借助Pseudo LiDAR的思想再过渡到BEV的表示方式,因此此处合并在一个类别下。

BEV without Pseudo LiDAR

BEV的表示下,不同的车辆之间不会出现重叠的现象。过去常使用IPM的方式得到BEV的图像,但是这种方案假设所有的像素都在地面上,并且需要获得相机精准的内外参。对于在路上行进的车辆而已,外参是实时变化的,其精度可能无法达到IPM的要求。

2019—IV—Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image

这篇文章使用IMU数据来进行外参的在线标定从而得到更精确的IPM图像,然后在此之上进行目标检测。

2019—BMVC—OFT—Orthographic Feature Transform for Monocular 3D Object Detection

本文是将透视图转到BEV视角的另一种方式。其思想是利用正交特征变换来将透视视角的图像特征转到正交BEV下。然后通过在投影的voxel area上累加图像特征从而得到voxel-based的特征,再沿着垂直维度折叠voxel feature以产生正交地平面的特征。
image.png
OFT的思路非常简单,效果也很好。

Although the back-projection step could have been improved by using some heuristics for better initialization of the voxel-based features, rather than naively do a back-projection.
review的作者的观点,还需要熟悉一下文章体会一下。

Pseudo LiDAR

基于Pseudo-LiDAR的方案的实质是根据2D图像估计深度,借助单目深度估计的方案生成点云。

2017—CVPR—MonoDepth—Unsupervised Monocular Depth Estimation with Left-Right Consistency

2018—CVPR—MLF—Multi-Level Fusion based 3D Object Detection from Monocular Images

本文是严格意义上第一篇提出将估计的深度信息lift到3D的,使用估计的深度信息将每个像素从RGB图像上投影到3D空间,然后再将得到的点云特征和图像特征融合起来得到3D目标框。
image.png

2018—CVPR—Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

本文在得到Pseudo-LiDAR的特征后,直接使用最新的lidar-based的3D检测器。

The authors argue that representation matters, and convolution on depth map does not make sense, as neighboring pixels on depth images may be physically far away in 3D space.
review

2018—CVPR—Frustum PointNets for 3D Object Detection from RGB-D Data

image.png

2019—ICCV—Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud

本文提出了pseudo-lidar的方案的两个缺点:不准确的深度估计导致的local misalignment 和目标外围深度伪影导致的long tail。提出了实例分割mask,并引入了2D-3D框的consistency loss。
image.png

ForeSeE

ForeSeE 也注意到了这些缺点,它们提出不是所有的pixel在深度估计中都有相同的重要性。因此它们训练了两个不同的深度估计器,一个用于前景、一个用于背景。在推理的时候自适应的融合深度图。

2019—ICCV—AM3D—Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

AM3D proposes a multi-modal fusion module to enhance the pseudo-LiDAR with color information

202008—ECCV2020—PatchNet—Rethinking Pseudo-LiDAR Representation

PatchNet organizes pseudo-LiDAR into the image representation and utilizes powerful 2D CNN to boost the detection performance.
MonoFlex

202103—ICCV—Are we Missing Confidence in Pseudo-LiDAR Methods for Monocular 3D Object Detection?

本文认为虽然PL-based的方法表现出比RGB image only的方法更好的指标,但是并不意味着PL-based的方法是完全优于后者的,其中还有KITTI数据集中存在的问题。
image.png

2021—CVPR—Objects are Different: Flexible Monocular 3D Object Detection

MonoFlex针对截断目标设计了不同的3D目标框解耦方式

DD3D

2022—CVPR—Homography Loss for Monocular 3D Object Detection

截屏2022-11-17 11.31.32.png

Related Review

XPeng—-Patrick Langechuan Liu—-2020
Monocular 3D Object Detection in Autonomous Driving — A Review


文章作者: Jingyi Yu
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Jingyi Yu !
  目录