AutoReg3D: On the Feasibility and Opportunity of Autoregressive 3D Object Detection

Abstract

LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class.

This near-to-far ordering mirrors LiDAR geometry — near objects occlude far ones but not vice versa — enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.

How It Works

1

Point Cloud Encoder

Pillar, Voxel, Transformer, or Mamba backbone

2

Tokenize Objects

Each 3D box → 10 discrete tokens

3

Autoregressive Decoder

Near-to-far autoregressive generation

What AutoReg3D Eliminates

Anchor assignment & configuration

Hand-crafted per-attribute loss weighting

Proposal matching (e.g., Hungarian)

Confidence score thresholding

Non-Maximum Suppression (NMS)

Replaced by a single unified cross-entropy loss over tokenized sequences with autoregressive decoding.

Key Contributions

First fully autoregressive 3D detector that directly generates object sequences from point clouds, performing on par with leading proposal-based and query-based detectors.
Detailed ablation of design factors—including object-level tokenization, sequence ordering, and decoding methodologies—that are critical for effective autoregressive 3D detection.
New capabilities enabled by the autoregressive formulation: compatibility with RL fine-tuning, and promptable decoding.

Results on nuScenes

We evaluate AutoReg3D across four different point-cloud encoder families. For fair comparison, confidence thresholds for baseline detectors are selected to maximize F1 on the training set.

Method	Encoder	Det. Head	Prec.	Rec.	F1
PointPillars	Pillar Conv.	Anchor-based	58.3	50.0	53.1
CenterPoint	Pillar Conv.	Center-based	67.9	53.3	59.5
AutoReg3D (Ours)	Pillar Conv.	Autoregressive	69.6	52.4	59.2

SECOND	Voxel Conv.	Anchor-based	63.5	55.6	59.1
CenterPoint	Voxel Conv.	Center-based	72.8	60.3	65.8
AutoReg3D (Ours)	Voxel Conv.	Autoregressive	74.9	59.4	65.8

DSVT	Transformer	Non-Autoregressive	79.1	66.3	71.6
AutoReg3D (Ours)	Transformer	Autoregressive	77.0	64.1	69.5

LION	Mamba	Non-Autoregressive	78.6	68.3	72.5
AutoReg3D (Ours)	Mamba	Autoregressive	77.5	65.2	70.4

nuScenes Validation Detection Performance. AutoReg3D achieves competitive performance across all encoder types. Notably, it attains higher precision than regression-based counterparts on pillar and voxel backbones — a benefit of autoregressive generation where each box is conditioned on previous outputs.

Reinforcement Learning Fine-Tuning

While teacher-orcing optimizes token likelihood, it does not explicitly optimize the set-level detection objective. The autoregressive formulation uniquely enables RL-based optimization with task-aligned rewards, improving global consistency. We show using GRPO with an IoU-based reward, we improve the voxel-based model's F1 score.

GRPO fine-tuning boosts F1 through improved recall, reflecting the task-specific reward that encourages more complete detection sequences and successfully detect objects that were previously missed.

Model	Prec.	Rec.	F1
Teacher Forcing	74.9	59.4	65.8
+ GRPO	74.5	60.9	66.7

Ablation Studies

Object Ordering

Order	F1
Random	56.3
Point Number	61.8
Near-to-far	65.8

Near-to-far ordering significantly outperforms alternatives by exploiting natural inter-object dependencies.

Token Ordering

Class Position	F1
Last	64.9
Middle	65.2
First	65.8

Predicting class first provides useful context for subsequent attribute tokens.

Decoding Strategy

Method	F1
Nucleus	61.9
Greedy	65.8
Beam Search	66.1

Beam search trades inference time for accuracy; greedy is competitive with minimal overhead.

Robustness Under Occlusion

Visibility	CenterPoint	Ours	Δ
0–40%	28.9	30.0	+4.1%
40–60%	30.5	31.1	+1.8%
60–80%	43.2	42.7	−1.3%
80–100%	67.0	67.6	+0.8%

Autoregressive modeling particularly helps under heavy occlusion, where inter-object dependencies are most informative.

Cascading Refinement

Model	Prec.	Rec.	F1
Prior only	74.9	59.4	65.8
Completion only	68.9	49.9	56.3
Prior → Completion	74.7	60.2	66.2

Conditioning a random-order model on the prior's outputs recovers missed detections, improving overall F1.

Qualitative Results

Visualization of AutoReg3D detections on nuScenes validation scenes across four encoder backbones. Each image shows predicted 3D bounding boxes overlaid on the BEV point cloud.

Generates boxes from first to last, ground-truth boxes are in gray.

Pillar Conv.

Voxel Conv.

Transformer

Mamba

Cascading Refinement

Comparison of detections before and after cascading refinement. The completion model conditioned on the prior's outputs recovers missed detections while preserving precision.

Generates boxes from first to last, ground-truth boxes are in gray.

Ground Truth

Before Refinement

After Refinement

BibTeX

@inproceedings{huang2026autoreg3d,
  title     = {On the Feasibility and Opportunity of Autoregressive 3D Object Detection},
  author    = {Huang, Zanming and Yoo, Jinsu and Jeon, Sooyoung and Liu, Zhenzhen and Campbell, Mark
               and Weinberger, Kilian Q and Hariharan, Bharath and Chao, Wei-Lun and Luo, Katie Z},
  booktitle = {CVPR Findings},
  year      = {2026}
}

On the Feasibility and Opportunity ofAutoregressive 3D Object Detection

Abstract

How It Works

Point Cloud Encoder

Tokenize Objects

Autoregressive Decoder

What AutoReg3D Eliminates

Key Contributions

Results on nuScenes

Reinforcement Learning Fine-Tuning

Ablation Studies

Object Ordering

Token Ordering

Decoding Strategy

Robustness Under Occlusion

Cascading Refinement

Qualitative Results

Cascading Refinement

BibTeX

On the Feasibility and Opportunity of
Autoregressive 3D Object Detection