Table of Contents
affordable models , powerful capabilities
Complete ChatGPT cloning solution
系统性能优化与开发加速
ColossalChat和Alpaca的区别" >ColossalChat和Alpaca的区别
Home Technology peripherals AI The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Apr 14, 2023 pm 10:58 PM
ai plan

AI applications and large models represented by ChatGPT and GPT4 are popular around the world and are regarded as opening up a new technological industrial revolution and a new starting point for AGI (Artificial General Intelligence). Not only are technology giants chasing each other and vying to launch new products, but many AI tycoons in academia and industry are also investing in related entrepreneurial waves. Generative AI is rapidly iterating in "days" and continues to surge!

However, OpenAI has not made it open source. What are the technical details behind them? How to quickly follow, catch up and participate in this technology wave? How to reduce the high cost of building and applying large AI models? How to protect core data and intellectual property from being leaked due to the use of third-party large model APIs?

As the most popular open source AI large model solution, Colossal-AI is the first to establish a model that includes supervised data set collection -> supervised fine-tuning -> reward model Training -> The complete RLHF process of reinforcement learning fine-tuning , based on LLaMA pre-training model, launched ColossalChat, is currently the practical open source project closest to the original technical solution of ChatGPT!

Open source address: ​https://github.com/hpcaitech/ColossalAI​

Contains the following content:

1. Demo: You can directly experience the model effect online without registration or waitinglist

2. Training code: Open source complete RLHF training code, which has been open sourced to include 7B and 13B models

3. Dataset: Open source 104K Chinese and English bilingual data set

4. Inference deployment: 4bit quantitative inference 7 billion parameter model only requires 4GB of video memory

5. Model weight: Only a single machine The server can quickly reproduce with a small amount of computing power

6. Larger-scale models, data sets, other optimizations, etc. will maintain high-speed iteration to add

affordable models , powerful capabilities

ColossalChat only needs less than 10 billion parameters, and performs RLHF fine-tuning on the basis of a large language model to master Chinese and English bilingual capabilities, reaching a level similar to ChatGPT and GPT-3.5 Effect.

For example, common sense question and answer:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Chinese answer:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Write an email:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Write an algorithm:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Complete ChatGPT cloning solution

Although GPT series models such as ChatGPT and GPT-4 are very powerful, they are unlikely to be fully open source. Fortunately, the open source community continues to work hard.

For example, Meta has open sourced the LLaMA model. The number of parameters of this model ranges from 7 billion to 65 billion. 13 billion parameters can outperform the 175 billion GPT-3 model in most cases. Benchmark performance. However, because it was not instructed to fine-tune (instruct tuning), the actual generation effect was not ideal.

Stanford's Alpaca generates training data in a self-instruct manner by calling the OpenAI API, so that a lightweight model with only 7 billion parameters can be fine-tuned at very low cost. The dialogue effect is comparable to that of ultra-large-scale language models with hundreds of billions of parameters like GPT-3.5.

ButThe existing open source solutions can be regarded as supervised fine-tuning models that only get the first step in reinforcement learning with human feedback (RLHF), no subsequent alignment and fine-tuning work has been performed. At the same time, Alpaca’s training data set is too small and the corpus is only in English, which also limits the performance of the model to a certain extent.

The amazing effect of ChatGPT and GPT-4 lies in the introduction of RLHF into the training process, making the generated content more consistent with human values.

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

##The three stages of RLHF

Based on the LLaMA model, Colossal-AI is the first open source Chat-like model reproduction solution ColossalChat that includes a complete RLHF process. It is currently closest to the original technical route of ChatGPT's practical open source project!

Open source of training data set

ColossalChat has open sourced a Chinese and English bilingual data set containing about 100,000 questions and answers. This data set collects and cleans real questioning scenarios of people on social platforms as a seed data set, uses self-instruct technology to expand the data, and costs about $900 for annotation. Compared with the data sets generated by other self-instruct methods, the seed data of this data set is more real and rich, and the generated data set covers more topics. This data can be used for both fine-tuning and RLHF training. Through high-quality data, ColossalChat can conduct better conversational interactions and support Chinese.

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

##ColossalChat data set collection process

RLHF algorithm reproduction

RLHF-Stage1 is supervised-fintuning, that is, using the data set mentioned above for model fine-tuning.

RLHF-Stage2 trained the reward model. It manually sorted different outputs of the same prompt to obtain the corresponding scores and supervised the training of the reward model.

RLHF-Stage3 uses a reinforcement learning algorithm, which is the most complex part of the training process:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

RLHF-Stage3 algorithm flow chart

In the PPO part, ColossalChat is divided into two stages: the first is the Make Experience part, using The SFT, Actor, RM, and Critic model calculations generate Experience and store it in the buffer; followed by the parameter update part, the Experience is used to calculate the strategy loss and value loss.

In the PTX part, ColossalChat calculates the cross-entropy loss function of the Actor output response and the answer part of the input corpus, which is used to add the pre-training gradient to the PPO gradient to maintain the original language model Performance prevents forgetting. Finally, the strategy loss, value loss and PTX loss are summed for back propagation and parameter update.

Get started quickly

ColossalChat has open sourced the complete code for reproducing the three stages of training ChatGPT based on the LLaMA model.

The first stage, train the SFT model:

# Training with a 4-GPU servers
colossalai run --nproc_per_node=4 train_sft.py 
--pretrain "/path/to/LLaMa-7B/" 
--model 'llama' 
--strategy colossalai_zero2 
--log_interval 10 
--save_path/path/to/Coati-7B 
--dataset /path/to/data.json 
--batch_size 4 
--accimulation_steps 8 
--lr 2e-5
Copy after login

The second stage, train the reward model:

# Training with a 4-GPU servers
colossalai run --nproc_per_node=4 train_reward_model.py 
--pretrain "/path/to/LLaMa-7B/" 
--model 'llama' 
--strategy colossalai_zero2 
--dataset /path/to/datasets
Copy after login

The third stage, using RL training:

# Training with a 8-GPU servers
colossalai run --nproc_per_node=8 train_prompts.py prompts.csv 
--strategy colossalai_zero2 
--pretrain "/path/to/Coati-7B" 
--model 'llama' 
--pretrain_dataset /path/to/dataset
Copy after login

After obtaining the final model weights, you can also reduce the cost of inference hardware through quantification and start the online inference service with just a single A GPU with approximately 4GB of video memory can complete the deployment of the 7 billion parameter model inference service.

python server.py/path/to/pretrained --quant 4bit --gptq_checkpoint /path/to/coati-7b-4bit-128g.pt --gptq_group_size 128
Copy after login

系统性能优化与开发加速

ColossalChat 能够快速跟进 ChatGPT 完整 RLHF 流程复现,离不开 AI 大模型基础设施 Colossal-AI 及相关优化技术的底座支持,相同条件下训练速度相比 Alpaca 采用的 FSDP (Fully Sharded Data Parallel) 可提升三倍左右

系统基础设施 Colossal-AI

AI 大模型开发系统 Colossal-AI 为该方案提供了基础支持,它可基于 PyTorch 高效快速部署 AI 大模型训练和推理,从而降低 AI 大模型应用的成本。Colossal-AI 由加州伯克利大学杰出教授 James Demmel 和新加坡国立大学校长青年教授尤洋领导开发。自从它开源以来,Colossal-AI 已经多次在 GitHub 热榜位列世界第一,获得 GitHub Star 约两万颗,并成功入选 SC、AAAI、PPoPP、CVPR、ISC 等国际 AI 与 HPC 顶级会议的官方教程。

减少内存冗余的 ZeRO + Gemini

Colossal-AI 支持使用无冗余优化器 (ZeRO) 提高内存使用效率,低成本容纳更大模型,同时不影响计算粒度和通信效率。自动 Chunk 机制可以进一步提升 ZeRO 的性能,提高内存使用效率,减少通信次数并避免内存碎片。异构内存空间管理器 Gemini 支持将优化器状态从 GPU 显存卸载到 CPU 内存或硬盘空间,以突破 GPU 显存容量限制,扩展可训练模型的规模,降低 AI 大模型应用成本。

使用 LoRA 低成本微调

Colossal-AI 支持使用低秩矩阵微调(LoRA)方法,对 AI 大模型进行低成本微调。LoRA 方法认为大语言模型是过参数化的,而在微调时,参数改变量是一个低秩矩阵。因此,可以将这个矩阵分解为两个更小的矩阵的乘积。在微调过程中,大模型的参数被固定,只有低秩矩阵参数被调整,从而显著减小了训练所需的参数量,并降低成本。

低成本量化推理

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

GPTQ 量化

为降低推理部署成本,Colossal-AI 使用 GPTQ 4bit 量化推理。在 GPT/OPT/BLOOM 类模型上,它比传统的 RTN (rount-to-nearest) 量化技术能够获得更好的 Perplexity 效果。相比常见的 FP16 推理,它可将显存消耗降低 75%,只损失极少量的吞吐速度与 Perplexity 性能。

以 ColossalChat-7B 为例,在使用 4bit 量化推理时,70 亿参数模型仅需大约 4GB 显存即可完成短序列(生成长度为 128 )推理,在普通消费级显卡上即可完成(例如 RTX 3060 Laptop),仅需一行代码即可使用。

if args.quant == '4bit':
model = load_quant (args.pretrained, args.gptq_checkpoint, 4, args.gptq_group_size)
Copy after login

如果采用高效的异步卸载技术 (offload),还可以进一步降低显存要求,使用更低成本的硬件推理更大的模型。

ColossalChat和Alpaca的区别

1. ColossalChat 开源了第一个完整的RLHF pipeline,斯坦福Alpaca没有做 RLHF,也就是没有做 Stage 2 和 Stage 3。

2. ColossalChat 采用了更多的指令数据,质量更好,范围更大,并使用强化学习做alignment 使回答更接近人类。

3. The ColossalChat training process integrates many system optimizations of Colossal-AI, and the training speed of the same data set and model size can be faster than Alpaca3 About times , allowing scientific researchers and small and medium-sized enterprises to independently train and deploy their own conversational systems.

4. The ColossalChat team collected more data sets themselves: a total of 24M tokens in English for training, about 30M tokens in Chinese, and a total of about 54M tokens. Among them, the data set collected by ColossalChat itself is 6M in English and 18M tokens in Chinese.

The following are some performances of ColossalChat and Alpaca in language dialogue (ColossalChat above, Alpaca below).

Write Quicksort in Python:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

## Write an email to the professor to request a letter of recommendation:

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Open collaboration

Although RLHF has been further introduced, due to the computing power Since the data set is limited, there is still room for improvement in actual performance in some scenarios.

The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.

Fortunately, unlike in the past, large AI models and cutting-edge technologies were only monopolized by a few technology giants. Open source communities such as PyTorch, Hugging Face and OpenAI are closely related to Start-ups also play a key role in this wave. Drawing on the successful experience of the open source community, Colossal-AI welcomes all parties to participate in co-construction and embrace the era of large models!

You can contact or participate through the following methods:

1. Post an issue on GitHub or submit a pull request (PR)

2. Join the Colossal-AI user WeChat or Slack group to communicate

3. Send a formal cooperation proposal to the email youy@comp.nus.edu.sg

Open source address:

​https://github.com/hpcaitech/ColossalAI​

The above is the detailed content of The 0-threshold cloning solution has been upgraded, the open source model is completely reproduced, and no registration is required for online experience.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1666
14
PHP Tutorial
1273
29
C# Tutorial
1253
24
How to use the chrono library in C? How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to handle high DPI display in C? How to handle high DPI display in C? Apr 28, 2025 pm 09:57 PM

Handling high DPI display in C can be achieved through the following steps: 1) Understand DPI and scaling, use the operating system API to obtain DPI information and adjust the graphics output; 2) Handle cross-platform compatibility, use cross-platform graphics libraries such as SDL or Qt; 3) Perform performance optimization, improve performance through cache, hardware acceleration, and dynamic adjustment of the details level; 4) Solve common problems, such as blurred text and interface elements are too small, and solve by correctly applying DPI scaling.

How to understand DMA operations in C? How to understand DMA operations in C? Apr 28, 2025 pm 10:09 PM

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

What is real-time operating system programming in C? What is real-time operating system programming in C? Apr 28, 2025 pm 10:15 PM

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

Steps to add and delete fields to MySQL tables Steps to add and delete fields to MySQL tables Apr 29, 2025 pm 04:15 PM

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

How to measure thread performance in C? How to measure thread performance in C? Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Apr 30, 2025 pm 07:24 PM

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

Top 10 digital currency trading platforms: Top 10 safe and reliable digital currency exchanges Top 10 digital currency trading platforms: Top 10 safe and reliable digital currency exchanges Apr 30, 2025 pm 04:30 PM

The top 10 digital virtual currency trading platforms are: 1. Binance, 2. OKX, 3. Coinbase, 4. Kraken, 5. Huobi Global, 6. Bitfinex, 7. KuCoin, 8. Gemini, 9. Bitstamp, 10. Bittrex. These platforms all provide high security and a variety of trading options, suitable for different user needs.

See all articles