For business leaders in Singapore and the Philippines, the energy cost of an AI query is no longer an abstract sustainability talking point. It now affects cloud operating budgets, data center planning, procurement decisions, and the architecture choices behind customer service, analytics, software development, and search experiences. In 2023, most AI teams still treated inference efficiency as a secondary concern behind model quality and deployment speed. By 2026, the cost of serving a single query has become a core performance metric, because every token generated, every GPU cycle consumed, and every millisecond of idle compute translates into direct financial and environmental impact.
The key shift is not that AI suddenly became expensive. It is that adoption scaled fast enough for the unit economics to matter. Enterprises across Singapore’s financial services, logistics, and public sector, and across the Philippines’ outsourcing, fintech, and retail sectors, are now asking the same question: how much energy does one AI query consume, and how much has that changed since 2023? The answer depends on model size, context length, batching, hardware generation, quantization, and serving stack design. But the directional pattern is clear. 2026 systems are materially more efficient than 2023 systems for a comparable task, even as demand for longer contexts and more complex responses pushes total consumption upward.
What “one AI query” actually measures
A single AI query is not a fixed unit of work. It can mean a short prompt to a language model, a retrieval-augmented question that triggers vector search plus generation, or a multimodal request that processes text, images, audio, or video. From an energy perspective, the query has two major parts: prompt processing and token generation. Prompt processing is the model reading the input context. Token generation is the model producing output one token at a time, which usually dominates latency and can dominate power draw depending on the response length.
In 2023, many benchmark discussions focused on model parameter counts and raw throughput. That was useful, but it missed the full picture. Real-world enterprise serving efficiency depends on the entire inference stack: the model architecture, the GPU or accelerator used, memory bandwidth, kernel optimization, request batching, caching, and how much context is retained between turns. A 200-token question and a 2,000-token knowledge work request are not comparable, even if they use the same model endpoint. Any serious comparison between 2026 and 2023 has to define the workload precisely.
Energy per query depends on tokens, not just prompts
Token count is the most practical proxy for energy cost in text-based AI. More input tokens mean more attention calculations. More output tokens mean more decoding steps. Longer context windows also increase the cost of each generated token because the model has to attend to more stored information. That means a customer support agent using a short classification prompt may consume far less energy than an internal analyst asking for a multi-step report with citations, tables, and follow-up reasoning.
For Singapore and Philippine enterprises, this distinction matters because many production use cases are now agentic. Instead of one query, a user request may trigger multiple tool calls, database lookups, RAG retrieval passes, and response synthesis steps. Energy accounting must therefore move from a simplistic “per prompt” metric to a “per completed business task” metric.
How 2026 compares with 2023 at the hardware and model layer
Between 2023 and 2026, the biggest efficiency gains come from three areas: newer accelerators, optimized inference frameworks, and more efficient model architectures. In 2023, many deployments relied on early generations of large language models served on high-end GPUs with relatively limited optimization. Teams often underutilized hardware because requests were processed one by one, context windows were large but not efficiently handled, and memory pressure constrained throughput.
By 2026, serving stacks have improved substantially. Modern accelerators deliver higher tokens per second per watt, and software layers are better at batching requests dynamically, quantizing weights, fusing kernels, and reducing memory movement. That matters because moving data in and out of memory can consume a significant portion of total inference energy. In practical terms, the same quality of output can often be produced with fewer floating-point operations, lower precision arithmetic, and less wasted idle time.
Why quantization changed the economics
Quantization reduces model precision from higher-precision formats to lower-precision ones such as 8-bit or 4-bit representations, depending on the architecture and accuracy target. This can sharply reduce memory footprint and increase throughput. In 2023, quantization was already known, but production readiness was inconsistent for many teams. By 2026, it is a standard optimization for a large share of enterprise inference workloads, especially where slight quality tradeoffs are acceptable.
The result is straightforward: less memory bandwidth, fewer accelerator cycles, and lower power draw per token. For business decision-makers, that can translate into lower cost per 1,000 queries and lower thermal load in on-premises or colocation environments. For teams running hybrid deployments in Singapore or Manila, the benefit also includes easier capacity planning, because smaller footprints reduce pressure on constrained rack space and cooling budgets.
Model specialization reduces waste
Another major change is the move away from using one giant model for every task. In 2023, organizations often routed many use cases through a single general-purpose model because orchestration was still immature. In 2026, production systems more often use cascades: a lightweight classifier handles simple intents, a smaller domain model answers routine queries, and a larger model is reserved for edge cases or high-value tasks. This tiered design reduces energy per query by avoiding overprovisioning.
That is especially relevant in customer service, lead qualification, and internal knowledge support. A small model can answer a straightforward policy question, while only complex cases escalate to a large model with retrieval and reasoning. The savings do not come from one magic optimization. They come from routing the right request to the right compute class.
What has changed in practical energy terms
It is tempting to ask for a single universal number, but that is not how inference systems work. The better approach is to compare the energy intensity profile of a representative query class. A short, text-only query on a heavily optimized 2026 stack typically consumes less energy than the same class of query on a 2023 stack because newer hardware and software reduce wasted compute. At the same time, the average enterprise query in 2026 often includes longer prompts, higher context retention, and more tool invocation, which offsets some of the per-token gains.
So the core pattern is this: energy per token has generally improved, but energy per business workflow is not guaranteed to fall unless teams actively optimize orchestration. This is why some organizations report lower unit costs while total AI spending still rises. The system is more efficient, but usage is growing faster than efficiency gains.
Short prompts versus long-context retrieval workflows
A short prompt that asks for classification or extraction can be served efficiently, especially with batching and caching. A long-context workflow, however, can become much more expensive because attention mechanisms scale with context length, and retrieval pipelines introduce their own compute overhead. In 2023, long-context use was less common. In 2026, long-context is standard in legal review, financial analysis, software engineering, and enterprise search.
For organizations in Singapore’s regulated sectors, long-context handling also requires more logging, auditability, and security controls, which can add indirect infrastructure overhead. In the Philippines, where BPO and contact center operations are scaling AI-assisted workflows rapidly, response consistency and fast turn times often matter more than maximum model size. That creates a strong case for smaller models, better prompt engineering, and retrieval discipline rather than indiscriminate model scaling.
Batching and caching drive measurable savings
Dynamic batching allows multiple user requests to share GPU cycles more efficiently. KV caching stores intermediate attention states so the model does not recompute everything from scratch on every turn. These techniques were available in 2023, but by 2026 they are more mature and more widely used. Their importance grows as systems move from isolated chat demos to high-volume production traffic.
From an operational perspective, batching improves throughput and lowers energy per query, but it can increase latency if misconfigured. That creates an engineering tradeoff. B2B teams should tune batch size based on SLA class, not on abstract efficiency goals alone. A sales enablement assistant with loose latency requirements can batch more aggressively than a real-time support bot handling live conversations.
Regional implications for Singapore and the Philippines
Singapore’s digital infrastructure strategy emphasizes efficiency, resilience, and sustainable data center growth. That makes AI energy consumption a board-level issue, not just an engineering concern. Enterprises operating in Singapore face both physical constraints and policy pressure to justify compute-intensive deployments. Lower energy per query helps align AI adoption with broader sustainability and capacity objectives, especially for financial institutions, telcos, and government-facing platforms.
In the Philippines, demand is being driven by BPO transformation, multilingual support, e-commerce, fintech, and operations automation. Many organizations are adopting AI to improve agent productivity, reduce handle time, and enable 24/7 service. Energy efficiency still matters, but the immediate business case often centers on workload throughput and service quality. Even so, lower energy per query can reduce cloud bills, improve deployment feasibility for distributed teams, and support scaling without a proportional increase in infrastructure spend.
Cloud region selection changes the carbon and cost profile
The same AI workload can have different energy implications depending on where it runs. Cloud regions differ in grid mix, cooling efficiency, and hardware refresh cadence. Singapore-based deployments often prioritize proximity, low latency, and compliance, while Philippine teams may favor regional availability and cost optimization. If an enterprise can choose among regions, it should consider both service latency and the power profile of the serving region.
This is where procurement teams and technical teams need to work together. Lowest unit price does not always equal lowest total cost. A slightly cheaper region may increase latency, reduce reliability, or complicate data governance. The best decision balances user experience, compliance, and energy intensity.
How to measure and reduce query energy in production
Organizations should move from anecdotal estimates to instrumented measurement. The right process starts with workload segmentation. Separate short-form classification, retrieval-augmented answer generation, summarization, coding assistance, and agentic workflows. Then measure token counts, latency, accelerator utilization, batching efficiency, and memory pressure for each workload class. Without that segmentation, any “average energy per query” metric is too blunt to drive action.
Best practice also requires tracing the full request lifecycle. A query may look simple at the user interface but still trigger embedding generation, retrieval, reranking, safety checks, and post-processing. Each stage consumes energy. Enterprises that only measure the main model endpoint will undercount the true cost.
Technical levers that reduce energy without degrading service
- Use smaller models for high-volume, low-complexity tasks and reserve larger models for exceptions.
- Trim prompt length by removing redundant instructions, repeated policy text, and stale conversation history.
- Apply semantic caching to reuse answers for repeated questions with low variance.
- Use retrieval filters and rerankers to reduce unnecessary context injection.
- Quantize models where accuracy tolerances allow it.
- Enable dynamic batching for asynchronous use cases.
- Set token budgets for output generation to prevent runaway responses.
- Continuously benchmark latency, throughput, and power consumption together, not separately.
These controls are especially relevant for enterprises building internal copilots, customer service automation, and document intelligence systems. They also support governance by making resource usage visible. Teams can then make informed tradeoffs between response quality, latency, and infrastructure cost instead of treating AI usage as a black box.
Implementation checklist for enterprise teams
Start with a measurement baseline. Instrument GPU utilization, request volume, average input and output tokens, and model selection by use case. Add power telemetry where the platform supports it, or use infrastructure-level estimates tied to accelerator type and utilization. Then map business workflows to compute tiers so that each request category uses the lightest model that can meet the SLA and accuracy target.
Next, establish an optimization backlog. Prioritize prompt compression, caching, retrieval tuning, and model routing before scaling to larger models or more expensive serving clusters. For Singapore-based organizations, include sustainability and compliance stakeholders early in architecture reviews. For Philippines-based organizations, include operations leaders and customer experience owners so efficiency gains do not disrupt service quality. Reassess quarterly, because model families, accelerator generations, and serving frameworks evolve quickly, and the energy cost of one AI query in 2026 can shift again as soon as the stack changes.
Meta note for implementation teams: the most reliable way to manage AI energy cost is to treat it as an engineering metric, a procurement metric, and a governance metric at the same time.

I am Tricia Huang Mei, an Advertising Partner in Sotavento Medios with over two decades of experience in the Singapore advertising and business sectors. My career is defined by a commitment to driving high-impact marketing campaigns and fostering sustainable growth for the diverse business portfolios I manage.








