On the Feasibility and Opportunity of
Autoregressive 3D Object Detection

1The Ohio State University, 2Cornell University, 3Boston University, 4Stanford University

CVPR 2026 Findings

Abstract

LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class.

This near-to-far ordering mirrors LiDAR geometry — near objects occlude far ones but not vice versa — enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.


How It Works

1

Point Cloud Encoder

Pillar, Voxel, Transformer, or Mamba backbone

2

Tokenize Objects

Each 3D box → 10 discrete tokens

3

Autoregressive Decoder

Near-to-far autoregressive generation

Model Architecture Diagram

What AutoReg3D Eliminates

Anchor assignment & configuration
Hand-crafted per-attribute loss weighting
Proposal matching (e.g., Hungarian)
Confidence score thresholding
Non-Maximum Suppression (NMS)
Replaced by a single unified cross-entropy loss over tokenized sequences with autoregressive decoding.

Key Contributions

  1. First fully autoregressive 3D detector that directly generates object sequences from point clouds, performing on par with leading proposal-based and query-based detectors.
  2. Detailed ablation of design factors—including object-level tokenization, sequence ordering, and decoding methodologies—that are critical for effective autoregressive 3D detection.
  3. New capabilities enabled by the autoregressive formulation: compatibility with RL fine-tuning, and promptable decoding.

Results on nuScenes

We evaluate AutoReg3D across four different point-cloud encoder families. For fair comparison, confidence thresholds for baseline detectors are selected to maximize F1 on the training set.

Method Encoder Det. Head Prec. Rec. F1
PointPillars Pillar Conv. Anchor-based 58.3 50.0 53.1
CenterPoint Pillar Conv. Center-based 67.9 53.3 59.5
AutoReg3D (Ours) Pillar Conv. Autoregressive 69.6 52.4 59.2
SECOND Voxel Conv. Anchor-based 63.5 55.6 59.1
CenterPoint Voxel Conv. Center-based 72.8 60.3 65.8
AutoReg3D (Ours) Voxel Conv. Autoregressive 74.9 59.4 65.8
DSVT Transformer Non-Autoregressive 79.1 66.3 71.6
AutoReg3D (Ours) Transformer Autoregressive 77.0 64.1 69.5
LION Mamba Non-Autoregressive 78.6 68.3 72.5
AutoReg3D (Ours) Mamba Autoregressive 77.5 65.2 70.4

nuScenes Validation Detection Performance. AutoReg3D achieves competitive performance across all encoder types. Notably, it attains higher precision than regression-based counterparts on pillar and voxel backbones — a benefit of autoregressive generation where each box is conditioned on previous outputs.


Reinforcement Learning Fine-Tuning

While teacher-orcing optimizes token likelihood, it does not explicitly optimize the set-level detection objective. The autoregressive formulation uniquely enables RL-based optimization with task-aligned rewards, improving global consistency. We show using GRPO with an IoU-based reward, we improve the voxel-based model's F1 score.

GRPO fine-tuning boosts F1 through improved recall, reflecting the task-specific reward that encourages more complete detection sequences and successfully detect objects that were previously missed.

ModelPrec.Rec.F1
Teacher Forcing74.959.465.8
+ GRPO74.560.966.7

Ablation Studies

Object Ordering

OrderF1
Random56.3
Point Number61.8
Near-to-far65.8

Near-to-far ordering significantly outperforms alternatives by exploiting natural inter-object dependencies.

Token Ordering

Class PositionF1
Last64.9
Middle65.2
First65.8

Predicting class first provides useful context for subsequent attribute tokens.

Decoding Strategy

MethodF1
Nucleus61.9
Greedy65.8
Beam Search66.1

Beam search trades inference time for accuracy; greedy is competitive with minimal overhead.

Robustness Under Occlusion

VisibilityCenterPointOursΔ
0–40%28.930.0+4.1%
40–60%30.531.1+1.8%
60–80%43.242.7−1.3%
80–100%67.067.6+0.8%

Autoregressive modeling particularly helps under heavy occlusion, where inter-object dependencies are most informative.

Cascading Refinement

ModelPrec.Rec.F1
Prior only74.959.465.8
Completion only68.949.956.3
Prior → Completion74.760.266.2

Conditioning a random-order model on the prior's outputs recovers missed detections, improving overall F1.


Qualitative Results

Visualization of AutoReg3D detections on nuScenes validation scenes across four encoder backbones. Each image shows predicted 3D bounding boxes overlaid on the BEV point cloud.

Generates boxes from first to last, ground-truth boxes are in gray.

Pillar Conv.

Voxel Conv.

Transformer

Mamba


Cascading Refinement

Comparison of detections before and after cascading refinement. The completion model conditioned on the prior's outputs recovers missed detections while preserving precision.

Generates boxes from first to last, ground-truth boxes are in gray.


BibTeX

@inproceedings{huang2026autoreg3d,
  title     = {On the Feasibility and Opportunity of Autoregressive 3D Object Detection},
  author    = {Huang, Zanming and Yoo, Jinsu and Jeon, Sooyoung and Liu, Zhenzhen and Campbell, Mark
               and Weinberger, Kilian Q and Hariharan, Bharath and Chao, Wei-Lun and Luo, Katie Z},
  booktitle = {CVPR Findings},
  year      = {2026}
}