PowerInfer: Fast Large Language Model Served with Consumer Grade GPU

Due to their outstanding content generation capabilities, Generative Large Language Models are now at the forefront of the AI revolution with continuous efforts to improve their generative capabilities.

However, despite rapid developments, these models require significant computing power and resources. This is largely because they consist of hundreds of billions of parameters.

Moreover, thousands of GPUs are needed for productive artificial intelligence models to run smoothly, which leads to significant operational costs.

High operational demands are the main reason why generative AI models have not yet been effectively implemented on personal-grade devices.

In this article, we will discuss PowerInfer, a high-speed LLM inference engine designed for standard computers powered by a single consumer-grade GPU.

The PowerInfer framework aims to exploit the high locality inherent in LLM inference, characterized by a power law distribution in neuron activations.

This means that at any given time, a small subset of ‘hot’ neurons are consistently active across inputs, while the remainder, called ‘cold’ neurons, are activated according to specific inputs or requirements.

This approach enables the PowerInfer framework to reduce the computing power required for generative AI to produce desired outputs.

We will examine the PowerInfer framework in detail, examining its methodology, pipeline and practical application results. Let’s start.

Table of Contents

PowerInfer: Fast Big Language Model with Consumer Grade GPU

Generative Big Language Models, such as ChatGPT and DALL-E, handle advanced generative and natural language processing tasks.

Due to their high computational requirements, these models are often used in data centers with advanced GPUs.

The need for such high computing power limits their deployment to data centers and highlights the need to distribute large language models to more accessible native platforms such as personal computers.

Increasing the accessibility of large language models can reduce inference and context generation costs, improve data privacy, and allow model customization.

Additionally, data center deployments may prioritize high throughput, while local Master deployments may focus on low latency due to smaller batch sizes.

However, deploying these models to local devices poses significant challenges due to significant memory requirements.

Large language models that act as autoregressive transformers generate text on a token-by-token basis, with each token requiring access to the entire model consisting of hundreds of billions of parameters.

This requires a large number of high-end GPUs for low-latency output generation.

Additionally, native deployments often process individual requests sequentially, limiting the potential for parallel processing.

To meet the complex memory requirements of a generative AI framework, existing solutions use methods such as model offloading and compression.

Techniques such as distillation, pruning, and quantization reduce the model size but are still too large for standard-grade GPUs in personal computers.

Model offloading, which splits the model between CPUs and GPUs in the Transformer Layer, enables distributed layer processing between CPU and GPU memories.

However, this method is limited by the slow PCIe interconnect and limited computational capabilities of CPUs, resulting in high inference latency.

The PowerInference framework suggests that the mismatch between LLM inference features and hardware architecture is the primary cause of memory issues in LLM inference.

Ideally, frequently accessed data should be stored in high-bandwidth, limited-capacity GPUs, while less frequently accessed data should be stored in low-bandwidth, high-capacity CPUs.

However, the large parameter volume of each LLM inference iteration makes the working set too large for a single GPU, resulting in inefficient use of locality.

The inference process in large language models exhibits high locality, where each iteration activates a limited number of neurons.

The PowerInference framework aims to exploit this position by managing a small number of hot neurons with the GPU while the CPU manages the cold neurons.

It preselects and preloads hot neurons on the GPU and identifies neurons that are activated during runtime.

This approach minimizes costly PCIe data transfers, allowing GPUs and CPUs to independently process the neurons assigned to them.

But deploying LLMs to native devices faces hurdles. Online predictions, which are vital for identifying active neurons, consume significant amounts of GPU memory.

The PowerInfer framework uses an adaptive method to generate small estimators for layers with higher activation skewness and sparsity, preserving accuracy while reducing size.

In addition, Master frameworks require special sparse operators. The PowerInfer framework eliminates the need for specific sparse format conversions by using neuron-aware sparse operators that communicate directly with neurons.

Finally, it is difficult to optimally place activated neurons between the CPU and GPU. The PowerInfer framework uses an offline phase to generate a neuron placement policy, measure the impact of each neuron on the LLM inference results, and frame it as an integer linear problem.

Architecture and Methodology

The figure below details the architecture of the PowerInfer framework, which consists of offline and online components in the pipeline.

Due to the observed differences in locality properties between different regions , large language models The offline component allows the LLM framework to distinguish between hot and cold neurons by profiling activation sparsity.

On the other hand, in the offline phase, two types of neurons are loaded into both the CPU and the GPU by the inference engine, thus servicing LLM requests with low latency during runtime.

Offline Phase: Policy Solver and LLM Profiler

In the offline phase, an LLM profiler component uses requests derived from the public dataset to collect activation data from the inference process.

In the first step, it monitors the activation of neurons in all layers in the framework and proceeds to use a policy solver component to categorize neurons as hot or cold.

The main goal of the policy solver is to allocate more frequently activated neurons to GPU layers while allocating the rest to CPU layers.

In the second stage, the policy solver component uses neuron impact measurements and hardware specifications to balance the workload between layers and maximizes the GPU’s impact measurement for neurons using integer linear programming.

Online Phase: Neuron Aware Graduate Inference Engine

Once the offline phase executes successfully, the framework continues executing the online phase. In the third step of the process, the online engine assigns hot and cold neurons to the relevant processing units before processing user requests based on the output of the offline policy solver.

During runtime and in step 4, the online engine handles GPU-CPU calculations by creating CPU and GPU executors, which are threads that run on the CPU side.

The motor then predicts the activated neurons and continues to skip the inactive neurons.

The activated neurons are then pre-loaded into the GPU for further processing. Meanwhile, the CPU calculates and transfers the results so that its neurons integrate with the GPU.

Because the online engine uses sparse neuron-aware operators on CPUs as well as GPUs, it is able to focus on individual rows and columns of neurons within matrices.

Adaptive Sparsity Estimators

The key concept behind reducing computational loads with the online inference engine in the PowerInfer framework is that it only processes neurons that it predicts will be activated.

Traditionally, a framework in each Transformer layer uses two different predictors to predict the activation of neurons in the MLP and individual blocks of attention; as a result, the inference calculation is limited to neurons predicted to be active.

However, it is difficult to design effective predictors for local distribution because the limited amount of resources makes it difficult to balance model size and prediction accuracy.

Since these predictors are often used by the framework to predict active neurons, they need to be stored on the GPU to provide faster access.

However, frameworks often use a large number of estimators, which take up a significant amount of memory, even those required to store LLM parameters.

Moreover, the size of the predictors is generally determined by two factors: Internal Skewness and Sparsity of the LLM layers.

To optimize these factors, the PowerInfer framework leverages an iterative training method without fixed size for each predictor in the Transformer layer.

In the first step of this training method, the size of the base model is created based on the sparsity profile of the model, and to maintain accuracy, the size of the model is iteratively adjusted considering the internal activation skewness.

Neuron Placement and Management

As mentioned before, the offline policy solver component determines the neuron placement policy, while the online inference engine component loads the model into GPU and CPU memory according to the generated policy.

For each layer, with or without multiple weight matrices, the PowerInfer framework assigns each neuron to the CPU or GPU depending on whether the neuron is enabled while running.

For accurate results, it is important to ensure correct calculation of segmented neurons in the specified order.

To overcome this, the PowerInfer framework creates two neuron tables: one located on the GPU and the other in CPU memory; each table relates individual neurons to their original positions in the matrix.

Neuron Aware Operator

Given the sparsity of activation observed in large language models, inactive neurons and their weights can be skipped by matrix multiplication operations, thus necessitating the use of sparse operators.

Instead of using sparse operators with various limitations, the PowerInfer framework uses neuron-aware operators that compute activated neurons and their weights directly on the GPU and CPU, without requiring dense conversion at runtime.

Neuron-aware operators differ from traditional sparse operators in that they focus on individual row and column vectors within a single matrix rather than focusing on the entire matrix.

Neuron Placement Policy

To leverage the computational capabilities of CPUs and GPUs, the offline component in the PowerInfer framework creates a placement policy that guides the framework when allocating neurons to CPU or GPU layers.

The policy solver creates this policy and checks the neuron placement in each layer; this helps determine the computational workload for individual processing units.

When generating the placement policy, the policy solver component takes into account different factors, including the activation frequency for each neuron, the communication overhead, and the computational capabilities of each processing unit, such as bandwidths and memory size.

Results and Application

To demonstrate the generalization capabilities of the PowerInfer framework across devices with different hardware configurations, the experiments were performed on two different personal computers: one equipped with an Intel i9-13900K processor, NVIDIA RTX 4090 GPU, and 192 GB of host memory, while the other runs on an Intel i7-12700K processor, NVIDIA RTX 2080Ti GPU and 64 GB of host memory.

End-to-end performance of the PowerInfer framework compared to llama.cpp with batch size 1 and default deployment settings.

The framework then samples prompts from the ChatGPT and Alpaca datasets, given the length variability observed in real-world dialogue input and output. The figure below shows production speeds for different models.

As can be seen, the PowerInfer framework generates 8.32 tokens per second and up to 16 tokens per second, thus outperforming the lama.

cpp framework by a significant margin. Additionally, as the number of output tokens increases, the performance of the PowerInfer framework increases as the generation phase significantly affects the overall inference time.

Additionally, as can be seen in the image above, the PowerInfer framework outperforms the llama.cpp framework on low-end computers, with a peak token generation rate of 7 tokens per second and an average token generation rate of 5 tokens per second.

The image above shows the distribution of neuron loads between GPU and CPU for two frames. As can be seen, the PowerInfer framework significantly increases the neuron load share of the GPU from 20% to 70%.

The image above compares the performance of the two frameworks on two computers with different specifications.

As can be seen, the PowerInfer framework provides consistently high throughput token creation speed compared to the llama.cpp framework.

Our Final Thoughts

In this article, we talked about PowerInfer, a high-speed LLM inference engine for a standard computer powered by a single consumer-grade GP.

At its core, the PowerInfer framework attempts to leverage high-locality natural inference in LLMs, a method characterized by the power-law distribution of neuron activation.

The PowerInfer framework is a fast interfering system designed for large language models that uses adaptive estimators and neuron-aware operators to enable neurons and computational sparsity.

PowerInfer: Fast Large Language Model Served with Consumer Grade GPU

PowerInfer: Fast Big Language Model with Consumer Grade GPU

Architecture and Methodology

Offline Phase: Policy Solver and LLM Profiler