


A deep dive into models, data, and frameworks: an exhaustive 54-page review of efficient large language models
Large-scale language models (LLMs) have demonstrated compelling capabilities in many important tasks, including natural language understanding, language generation, and complex reasoning, and have had a profound impact on society. However, these outstanding capabilities require significant training resources (shown in the left image) and long inference times (shown in the right image). Therefore, researchers need to develop effective technical means to solve their efficiency problems.
In addition, as can be seen from the right side of the figure, some efficient LLMs (Language Models) such as Mistral-7B have been successfully used in the design and deployment of LLMs. These efficient LLMs can greatly reduce inference memory usage and reduce inference latency while maintaining accuracy similar to LLaMA1-33B. This shows that there are already some feasible and efficient methods that have been successfully applied to the design and use of LLMs.
In this review, experts from Ohio State University, Imperial College, Michigan State University, University of Michigan, Amazon, Google, Boson AI, Researchers at Microsoft Asia Research provide a systematic and comprehensive survey of research into efficient LLMs. They divided existing technologies for optimizing the efficiency of LLMs into three categories, including model-centric, data-centric and framework-centric, and summarized and discussed the most cutting-edge related technologies.
- Paper: https://arxiv.org/abs/2312.03863
- GitHub: https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey
In order to conveniently organize the papers involved in the review and keep them updated, the researcher created a GitHub repository and actively maintains it. They hope that this repository will help researchers and practitioners systematically understand the research and development of efficient LLMs and inspire them to contribute to this important and exciting field.
The URL of the warehouse is https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey. In this repository you can find content related to a survey of efficient and low-power machine learning systems. This repository provides research papers, code, and documentation to help people better understand and explore efficient and low-power machine learning systems. If you are interested in this area, you can get more information by visiting this repository.
Model Centric
A model-centric approach focuses on efficient techniques at the algorithm level and system level, where the model itself is the focus. Since LLMs have billions or even trillions of parameters and have unique characteristics such as emergence compared to smaller-scale models, new techniques need to be developed to optimize the efficiency of LLMs. This article discusses five categories of model-centric methods in detail, including model compression, efficient pre-training, efficient fine-tuning, efficient inference and efficient model architecture design.
1. Compression model In the field of machine learning, model size is often an important consideration. Larger models often require more storage space and computing resources, and may encounter limitations when running on mobile devices. Therefore, model compression is a commonly used technology that can reduce the size of the model
Model compression technology is mainly divided into four categories: quantization, parameter pruning, and low-rank estimation and knowledge distillation (see the figure below), in which quantization will compress the weights or activation values of the model from high precision to low precision, parameter pruning will search and delete the more redundant parts of the model weights, and low-rank estimation will reduce the model's weights. The weight matrix is converted into the product of several low-rank small matrices, and knowledge distillation directly uses the large model to train the small model, so that the small model has the ability to replace the large model when doing certain tasks.
2. Efficient pre-training
The cost of pre-training LLMs is very expensive. Efficient pre-training aims to improve efficiency and reduce the cost of the LLMs pre-training process. Efficient pre-training can be divided into mixed precision acceleration, model scaling, initialization technology, optimization strategy and system-level acceleration.
Mixed-precision acceleration improves the efficiency of pre-training by calculating gradients, weights, and activations using low-precision weights, which are then converted back to high-precision and applied to update the original weights. Model scaling accelerates pre-training convergence and reduces training costs by using the parameters of small models to scale to large models. Initialization technology speeds up the convergence of the model by designing the initialization value of the model. The optimization strategy focuses on designing lightweight optimizers to reduce memory consumption during model training. System-level acceleration uses distributed and other technologies to accelerate model pre-training from the system level.
3. Efficient fine-tuning
Efficient fine-tuning aims to improve LLMs Fine-tuning process efficiency. Common efficient fine-tuning technologies are divided into two categories, one is parameter-based efficient fine-tuning, and the other is memory-efficient fine-tuning.
The goal of parameter-based efficient fine-tuning (PEFT) is to tune LLM to downstream tasks by freezing the entire LLM backbone and updating only a small set of additional parameters. In the paper, we further divided PEFT into adapter-based fine-tuning, low-rank adaptation, prefix fine-tuning and prompt word fine-tuning.
Efficient memory-based fine-tuning focuses on reducing memory consumption during the entire LLM fine-tuning process, such as reducing the memory consumed by optimizer status and activation values.
4. Efficient reasoning
Efficient reasoning aims to improve LLMs reasoning process efficiency. Researchers divide common high-efficiency reasoning technologies into two categories, one is algorithm-level reasoning acceleration, and the other is system-level reasoning acceleration.
Inference acceleration at the algorithm level can be divided into two categories: speculative decoding and KV - cache optimization. Speculative decoding speeds up the sampling process by computing tokens in parallel using a smaller draft model to create speculative prefixes for the larger target model. KV - Cache optimization refers to optimizing the repeated calculation of Key-Value (KV) pairs during the inference process of LLMs.
System-level inference acceleration is to optimize the number of memory accesses on specified hardware, increase the amount of algorithm parallelism, etc. to accelerate LLM inference.
5. Efficient model architecture design
Efficient architecture design for LLMs refers to strategically optimizing the model structure and calculation process to improve performance and scalability while minimizing resource consumption. We divide efficient model architecture design into four major categories based on model types: efficient attention modules, hybrid expert models, long text large models, and architectures that can replace transformers.
The efficient attention module aims to optimize the complex calculations and memory usage in the attention module, while the mixed expert model (MoE) uses multiple reasoning decisions in some modules of LLMs. A small expert model is used as a replacement to achieve overall sparsity. The long text large model is an LLMs specially designed to efficiently process ultra-long text. The architecture that can replace the transformer is to reduce the complexity of the model and reduce the complexity of the model by redesigning the model architecture. Achieving comparable reasoning capabilities for post-transformer architectures.
Data-centric
A data-centric approach focuses on the quality and structure of data in role in improving the efficiency of LLMs. In this article, researchers discuss two types of data-centric methods in detail, includingdata selection and prompt word engineering.
1. Data selection
The data selection of LLMs is aimed at pre-training/ Fine-tune data for cleaning and selection, such as removing redundant and invalid data, to speed up the training process.
2. Prompt Word Project
Prompt Word Project guides LLMs by designing effective inputs (prompt words) The efficiency of generating the desired output is that the prompt words can be designed to achieve model performance equivalent to that of tedious fine-tuning. Researchers divide common prompt word engineering technologies into three major categories: few-sample prompt word engineering, prompt word compression, and prompt word generation.
The few-sample prompt word project provides LLM with a limited set of examples to guide its understanding of the tasks that need to be performed. Prompt word compression accelerates LLMs' processing of input by compressing lengthy prompt input or learning and using prompt representations. Prompt word generation aims to automatically create effective prompts that guide the model to generate specific and relevant responses, rather than using manually annotated data.
Frame-centered
The researcher investigated The recently popular efficient LLMs framework lists the efficient tasks they can optimize, including pre-training, fine-tuning and inference (as shown in the figure below).
Summary
In this survey, the researcher provides everyone with a A systematic review of LLMs, an important research area dedicated to making LLMs more democratized. They begin by explaining why efficient LLMs are needed. Under an orderly framework, this paper investigates efficient technologies at the algorithmic level and system level of LLMs from the model-centered, data-centered, and framework-centered perspectives respectively.
Researchers believe that efficiency will play an increasingly important role in LLMs and LLMs-oriented systems. They hope that this survey will help researchers and practitioners quickly enter this field and serve as a catalyst to stimulate new research on efficient LLMs.
The above is the detailed content of A deep dive into models, data, and frameworks: an exhaustive 54-page review of efficient large language models. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Bitcoin’s price ranges from $20,000 to $30,000. 1. Bitcoin’s price has fluctuated dramatically since 2009, reaching nearly $20,000 in 2017 and nearly $60,000 in 2021. 2. Prices are affected by factors such as market demand, supply, and macroeconomic environment. 3. Get real-time prices through exchanges, mobile apps and websites. 4. Bitcoin price is highly volatile, driven by market sentiment and external factors. 5. It has a certain relationship with traditional financial markets and is affected by global stock markets, the strength of the US dollar, etc. 6. The long-term trend is bullish, but risks need to be assessed with caution.

The top ten cryptocurrency trading platforms in the world include Binance, OKX, Gate.io, Coinbase, Kraken, Huobi Global, Bitfinex, Bittrex, KuCoin and Poloniex, all of which provide a variety of trading methods and powerful security measures.

The top ten digital currency exchanges such as Binance, OKX, gate.io have improved their systems, efficient diversified transactions and strict security measures.

The top ten cryptocurrency exchanges in the world in 2025 include Binance, OKX, Gate.io, Coinbase, Kraken, Huobi, Bitfinex, KuCoin, Bittrex and Poloniex, all of which are known for their high trading volume and security.

MeMebox 2.0 redefines crypto asset management through innovative architecture and performance breakthroughs. 1) It solves three major pain points: asset silos, income decay and paradox of security and convenience. 2) Through intelligent asset hubs, dynamic risk management and return enhancement engines, cross-chain transfer speed, average yield rate and security incident response speed are improved. 3) Provide users with asset visualization, policy automation and governance integration, realizing user value reconstruction. 4) Through ecological collaboration and compliance innovation, the overall effectiveness of the platform has been enhanced. 5) In the future, smart contract insurance pools, forecast market integration and AI-driven asset allocation will be launched to continue to lead the development of the industry.

Recommended reliable digital currency trading platforms: 1. OKX, 2. Binance, 3. Coinbase, 4. Kraken, 5. Huobi, 6. KuCoin, 7. Bitfinex, 8. Gemini, 9. Bitstamp, 10. Poloniex, these platforms are known for their security, user experience and diverse functions, suitable for users at different levels of digital currency transactions

Currently ranked among the top ten virtual currency exchanges: 1. Binance, 2. OKX, 3. Gate.io, 4. Coin library, 5. Siren, 6. Huobi Global Station, 7. Bybit, 8. Kucoin, 9. Bitcoin, 10. bit stamp.

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.
