How AI Uses Electricity
A detailed guide to how AI models consume electricity, from GPU computation to data centre overhead. Understand why different models use different amounts of energy.
What happens when you send an AI query
When you type a prompt into ChatGPT, Gemini, or Claude, your text travels across the internet to a data centre, often hundreds or thousands of miles away. There, it arrives at a cluster of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs) — specialised chips designed for the matrix multiplications that power neural networks.
The computation happens in a pipeline. First, your text is tokenised — broken into sub-word units that the model understands. These tokens are converted into numerical vectors and fed through the model's layers. A large language model like GPT-4 has hundreds of billions of parameters, each of which participates in every forward pass. The model processes all input tokens simultaneously, then generates output tokens one at a time, each requiring a full pass through the network.
Each GPU in the cluster draws between 300 and 700 watts — comparable to a microwave oven. A single query to a large model might occupy 8 or more GPUs for several seconds. Multiply those watts by those seconds, and you get the energy cost of your query, measured in watt-hours (Wh).
Why different models use different amounts
Not all AI queries are created equal. The energy a model consumes depends on three main factors: the number of parameters, the model architecture, and the nature of the task.
Parameter count
Parameters are the learned values inside a neural network — its accumulated knowledge. More parameters means more computation per token. A 1-billion-parameter model like LLaMA 3.2 1B can run on a single GPU and uses roughly 0.2 Wh per query. A 405-billion-parameter model like LLaMA 3.1 405B requires 8 or more GPUs and can use over 20 Wh per query — a 100x increase.
Architecture
Mixture-of-Experts (MoE) models like DeepSeek-V3 have many total parameters but only activate a fraction of them per token. DeepSeek-V3 has 671 billion parameters but only activates roughly 37 billion per query, achieving efficiency closer to a model one-tenth its nominal size. Dense models, by contrast, use every parameter for every token.
Task type
Text generation is the most efficient category. Image generation requires diffusion processes that iterate over the entire image multiple times — typically 20 to 50 denoising steps. Video generation multiplies this across hundreds of frames. Music generation falls somewhere in between. The result is an energy spectrum that spans several orders of magnitude.
The scale: from 0.001 Wh to 2,400 Wh
Across the 152 models in our database, energy per query ranges from 0.01 Wh for the most efficient text models to 2.40 kWh for the most demanding video generators. That is a 160,000x difference.
To put this in perspective: a standard Google search uses about 0.3 Wh. A typical ChatGPT query uses about 0.43 Wh — roughly 40% more. But generating a single AI video can use as much electricity as running a laptop for several hours.
Here is how a representative sample of models compares:
Energy per query in watt-hours. Values reflect a single typical interaction as defined by each model's benchmark context. See our methodology for details.
How data centres add overhead
The energy figures above reflect only the computation itself. In reality, data centres consume additional electricity for cooling systems, power distribution, lighting, networking equipment, and storage. This overhead is captured by a metric called Power Usage Effectiveness (PUE).
PUE is defined as the ratio of total facility energy to IT equipment energy. A PUE of 1.0 would mean zero overhead (impossible in practice). The industry average is approximately 1.58, according to the Uptime Institute. The most efficient hyperscale data centres — those operated by Google, Meta, and Microsoft — achieve PUEs between 1.1 and 1.2, meaning 10-20% of their total energy goes to non-compute functions.
Google reported a fleet-wide PUE of 1.10 in 2023. Meta achieved 1.08 across its facilities. These figures represent the state of the art. Smaller or older data centres can have PUEs of 1.5 or higher, effectively increasing the energy cost of every query by 50% or more compared to the computation alone.
Our estimates apply a PUE factor appropriate to each provider's infrastructure. For open-source models that can be self-hosted anywhere, we use a conservative global average. See our methodology for the specific values used.
What's improving
Despite growing demand, there are strong trends driving efficiency improvements in AI inference.
Hardware efficiency
Each generation of GPU and TPU hardware delivers substantially more computation per watt. NVIDIA's H100 GPU is roughly 3x more energy-efficient for inference than its predecessor, the A100. Google's TPU v5e is purpose-built for transformer inference and achieves even better efficiency for supported model architectures. As data centres upgrade their hardware, the energy cost per query falls — even as models grow larger.
Model distillation
Distillation trains a smaller "student" model to replicate the behaviour of a larger "teacher" model. The result is a model that retains most of the original's capability at a fraction of the compute cost. OpenAI's GPT-4o Mini, for example, delivers performance competitive with GPT-4 on most tasks while using roughly 3x less energy. Google achieved a 33x reduction in Gemini's energy per query between May 2024 and May 2025 through a combination of distillation and infrastructure improvements.
Mixture of Experts
MoE architectures activate only a subset of parameters per token, dramatically reducing compute requirements without sacrificing model capacity. This technique is now used by most frontier models, including DeepSeek-V3, Gemini, and Mistral's Mixtral. The efficiency gains are substantial: an MoE model with 200 billion total parameters might use only 20 billion per query, achieving the knowledge of a large dense model at the cost of a much smaller one.
Quantisation
Reducing the numerical precision of model weights from 16-bit to 8-bit or even 4-bit halves or quarters the memory and computation required, with minimal impact on output quality. This is especially impactful for open-source models, where users can choose the quantisation level that suits their quality-efficiency tradeoff.