ap.
Back to Blogs
Reinforcement Learning

Power Control for mmWave: From Optimization to Reinforcement Learning

By Apala Pramanik


Millimeter-wave (mmWave) communication promises multi-gigabit data rates and massive bandwidth, but these benefits come at a cost: significantly higher power consumption compared to sub-6 GHz systems. Wide bandwidths, dense deployments, and the need for high transmit power to compensate for severe path loss all contribute to growing energy demands. As 5G matures and 6G research accelerates, energy-efficient power control has become a central design challenge. This post surveys how the research community has approached transmit power control for mmWave systems, tracing a clear evolution from classical optimization to modern reinforcement learning methods.

Classical Optimization: Where It All Started

Early work on mmWave power control relied on deterministic optimization, treating power levels, beam patterns, and beamwidth as fixed decision variables. In uplink and downlink NOMA systems, for instance, analog beamforming was constrained by constant-modulus hardware, and joint power-beam optimization was formulated to maximize achievable sum rate. These problems were typically decomposed into smaller subproblems, such as beam-gain allocation or analog beamformer search, and solved using closed-form or semi-closed-form updates.

While elegant, these approaches depend on strong assumptions: dominant-path alignment, fixed user grouping, or simplified multipath models. They work well in controlled settings but struggle to scale in environments where channels vary quickly, such as vehicular scenarios or dense urban deployments.

Energy Efficiency as a Design Metric

As mmWave deployments evolved, energy efficiency moved to the foreground. Research on train-ground systems, for example, introduced long-range mmWave links with strong Doppler effects and strict quality-of-service constraints. Energy efficiency was defined as delivered data per unit energy, and techniques like fractional programming with Lagrangian updates were used to compute transmit-power profiles along the track. However, these solutions remained tied to deterministic propagation assumptions, fixed trajectories, and single-link analysis, limiting their applicability to dynamic, multi-user settings.

Enter Reinforcement Learning

Reinforcement learning (RL) emerged as a response to the limited adaptability of optimization-based designs. The appeal is straightforward: instead of relying on explicit channel models and mathematical decompositions, an RL agent can learn power control policies through direct interaction with the wireless environment.

Early RL work applied Q-learning to adjust downlink power levels in massive MIMO systems under jamming. The agent explored state-action pairs without explicit channel models, but depended on predefined state features, offline training samples, and discrete action sets, which restricted its ability to handle continuous control variables.

Subsequent studies adopted Deep Q-Network (DQN) frameworks to jointly determine beamforming directions, power levels, and interference-coordination actions in multi-cell environments. States encoded channel and interference information, and rewards tracked sum-rate improvements. However, centralized control introduced signaling overhead, and function approximation raised convergence and stability concerns.

Joint Beamwidth and Power Control

A natural extension was to treat beamwidth and power control jointly. Deep RL was used to map discrete beamwidth-power pairs to expected rate gains, with DQN architectures taking CSI-derived states and using instantaneous rate as the reward. The challenge here is that beamwidth and transmit power are inherently continuous parameters, and discretizing them affects both policy granularity and convergence speed.

Other RL-based beam-training models optimized energy efficiency or spectral efficiency by attaching reward penalties to training overhead. These formulations pushed the field forward but often assumed equal power distribution over paths and relied heavily on simulation-driven training.

Game Theory and Coverage Analysis

Parallel to the RL track, optimization and game-theoretic approaches continued to evolve. Generalized Nash games modeled power control and beamforming as interdependent choices, where best-response updates converge to equilibrium under specific feasibility conditions. Coverage analysis under beam-misalignment errors linked fractional power control to SNR degradation using statistical misalignment models. Secure mmWave-NOMA systems used clustering and successive convex approximation to allocate power under eavesdropping constraints, handling non-convexity through iterative DC transformations.

Multi-user coexistence between NR-U and WiGig was formulated as a mixed-integer nonlinear program, solved via penalty dual decomposition with CCCP for joint user grouping, beam coordination, and power allocation. While these methods are mathematically rigorous, their computational complexity often limits real-time applicability.

Addressing Mobility and Temporal Dynamics

More recent work has begun tackling mobility, multi-hop networks, and temporal variations. Location-aided beamwidth control using genetic algorithms maximized average rate under beamwidth and SNR constraints, though the method assumed a single line-of-sight path. Bootstrapped and Bayesian DQN techniques improved exploration by combining bootstrap sampling and Thompson sampling for joint beam and power decisions, but bootstrap diversity shrinks with accumulated data and Bayesian components increase running time.

Double-RIS-assisted multi-user uplink systems introduced collaborative beamforming and user power control solved through alternating optimization, inheriting the usual sensitivity to initialization and slow convergence that plagues alternating methods.

The Latest: Cooperative and Secure Vehicular Links

The most recent RL contributions target cooperative and secure vehicular mmWave links. Dueling Double Deep Recurrent Q-Networks integrate convolutional and LSTM layers to handle high-dimensional states and temporal correlation for joint beamforming, relay selection, and power allocation under secrecy constraints. These architectures represent a significant step in complexity, but the decision space remains discrete, limiting direct control of continuous transmit-power variables.

In multi-flow multi-hop networks, joint beam and power control has been solved using block coordinate descent with successive convex approximation for power and particle swarm optimization for 3D beam steering. While these hybrid methods converge to local solutions, their computational cost limits real-time deployment in dynamic channels.

The Gaps That Remain

Across this evolution, a clear pattern emerges: the field is moving from deterministic, decomposition-based optimization toward reinforcement learning and hybrid methods, motivated by the need to handle rapidly changing mmWave channels, multi-hop topologies, energy constraints, and mobility. Yet significant gaps remain:

These gaps highlight the need for approaches that operate directly on hardware, adapt online, and manage coupled parameters such as gain settings and beam directions without heavy computational overhead. Bridging the divide between the theoretical elegance of optimization and the adaptability of RL remains one of the most active frontiers in mmWave system design.