How to Build RAG Systems and AI Agents with Qwen3
Qwen just released 8 new models as part of its latest family – Qwen3, showcasing promising capabilities. The flagship model, Qwen3-235B-A22B, outperformed most other models including DeepSeek-R1, OpenAI’s o1, o3-mini, Grok 3, and Gemini 2.5-Pro, in standard benchmarks. Meanwhile, the small Qwen3-30B-A3B outperformed QWQ-32B which has approximately 10 times the activated parameters as the new model. With such advanced capabilities, these models prove to be a great choice for a wide range of applications. In this article, we will explore the features of all the Qwen3 models and learn how to use them to build RAG systems and AI agents.
Table of Contents
- What is Qwen3?
- Key Features of Qwen3
- How to Access Qwen3 Models via API
- Using Qwen3 to Power Your AI Solutions
- Prerequisites
- Building an AI Agent using Qwen3
- Building a RAG System using Qwen3
- Applications of Qwen3
- Conclusion
- Frequently Asked Questions
What is Qwen3?
Qwen3 is the latest series of large language models (LLMs) in the Qwen family, consisting of 8 different models. These include Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B. All these models are released under Apache 2.0 license, making them freely available to individuals, developers, and enterprises.
While 6 of these models are dense, meaning they actively use all the parameters during the time of inference and training, 2 of them are open-weighted:
- Qwen3-235B-A22B: A large model with 235 billion parameters, out of which 22 billion are activated parameters.
- Qwen3-30B-A3B: A smaller MoE with 30 billion total parameters and 3 billion activated parameters.
Here’s a detailed comparison of all the 8 Qwen3 models:
Models | Layers | Heads (Q/KV) | Tie Embedding | Context Length |
Qwen3-0.6B | 28 | 16/8 | Yes | 32K |
Qwen3-1.7B | 28 | 16/8 | Yes | 32K |
Qwen3-4B | 36 | 32/8 | Yes | 32K |
Qwen3-8B | 36 | 32/8 | No | 128K |
Qwen3-14B | 40 | 40/8 | No | 128K |
Qwen3-32B | 64 | 64/8 | No | 128K |
Qwen3-30B-A3B | 48 | 32/4 | No | 128K |
Qwen3-235B-A22B | 94 | 64/4 | No | 128K |
Here’s what the table says:
-
Layers: Layers represent the number of transformer blocks used. It includes multi-head self-attention mechanism, feed forward networks, positional encoding, layer normalization, and residual connections. So, when I say Qwen3-30B-A3B has 48 layers, it means that the model uses 48 transformer blocks, stacked sequentially or in parallel.
-
Heads: Transformers use multi-head attention, which splits its attention mechanism into several heads, each for learning a new aspect from the data. Here, Q/KV represents:
- Q (Query heads): Total number of attention heads used for generating queries.
- KV (Key and Value): The number of key/value heads per attention block.
Note: These attention heads for Key, Query, and Value are completely different from the key, query, and value vector generated by a self-attention.
Also Read: Qwen3 Models: How to Access, Performance, Features, and Applications
Key Features of Qwen3
Here are some of the key features of the Qwen3 models:
-
Pre-training: The pre-training process consists of three stages:
- In the first stage, the model was pretrained on over 30 trillion tokens with a context length of 4k tokens. This taught the model basic language skills and general knowledge.
- In the second stage, the quality of data was improved by increasing the proportion of knowledge-intensive data like STEM, coding, and reasoning tasks. The model was then trained over an additional 5 trillion tokens.
- In the final stage, high quality long context data was used by increasing the context length to 32K tokens. This was done to ensure that the model can handle longer inputs effectively.
-
Post-training: To develop a hybrid model capable of both step-by-step reasoning and rapid responses, a 4-stage training pipeline was implemented. This consisted of:
- Long chain-of-thoughts(CoT)
- Reasoning-based reinforcement learning (RL)
- Thinking mode fusion
- General RL
-
Hybrid Thinking Modes: Qwen3 models use a hybrid approach to problem solving, featuring two new modes:
- Thinking Mode: In this mode, models take time by breaking a complex problem statement into small and procedural steps to solve it.
- Non-Thinking Mode: In this mode, the model provides quick results and is mostly suitable for simpler questions.
- Multilingual Support: Qwen3 models support 119 languages and dialects. This helps users from all around the world to benefit from these models.
- Improvised Agentic Capabilities: Qwen has optimized the Qwen3 models for better coding and agentic capabilities, supporting Model Context Protocol (MCP) as well.
How to Access Qwen3 Models via API
To use the Qwen3 models, we will be accessing it via API using the Openrouter API. Here’s how to do it:
- Create an account on Openrouter and go to the model search bar to find the API for that model.
- Select the model of your choice and click on ‘Create API key’ on the landing page to generate a new API.
Using Qwen3 to Power Your AI Solutions
In this section, we’ll go through the process of building AI applications using Qwen3. We will first create an AI-powered travel planner agent using the model, and then a Q/A RAG bot using Langchain.
Prerequisites
Before building some real-world AI solutions with Qwen3, we need to first cover the basic prerequisites like:
- Familiarity with the command prompt or terminal and the ability to run them through the terminal.
- Ability to set up environment variables.
- Python must be installed: https://www.python.org/downloads/
- Knowledge on the basics of Langchain: https://www.langchain.com/
Building an AI Agent using Qwen3
In this section, we’ll be using Qwen3 to create an AI-powered travel agent that will give the major traveling spots for the city or place you are visiting. We will also enable the agent to search the internet to find updated information, and add a tool that enables currency conversion.
Step 1: Setting up Libraries and Tools
First, we will be installing and importing the necessary libraries and tools required to build the agent.
!pip install langchain langchain-community openai duckduckgo-search from langchain.chat_models import ChatOpenAI from langchain.agents import Tool from langchain.tools import DuckDuckGoSearchRun from langchain.agents import initialize_agent llm = ChatOpenAI( base_url="https://openrouter.ai/api/v1", api_key="your_api_key", model="qwen/qwen3-235b-a22b:free" ) # Web Search Tool search = DuckDuckGoSearchRun() # Tool for DestinationAgent def get_destinations(destination): return search.run(f"Top 3 tourist spots in {destination}") DestinationTool = Tool( name="Destination Recommender", func=get_destinations, description="Finds top places to visit in a city" ) # Tool for CurrencyAgent def convert_usd_to_inr(query): amount = [float(s) for s in query.split() if s.replace('.', '', 1).isdigit()] if amount: return f"{amount[0]} USD = {amount[0] * 83.2:.2f} INR" return "Couldn't parse amount." CurrencyTool = Tool( name="Currency Converter", func=convert_usd_to_inr, description="Converts USD to inr based on static rate" )
- Search_tool: DuckDuckGoSearchRun() enables the agent to use web search to get real-time information about the popular tourist spots.
- DestinationTool: Applies the get_destinations() function, which uses the search tool to get the top 3 tourist spots in any given city.
- CurrencyTool: Uses the convert_usd_to_inr() function to convert the prices from USD to INR. You can change ‘inr’ in the function to convert it to a currency of your choice.
Also Read: Build a Travel Assistant Chatbot with HuggingFace, LangChain, and MistralAI
Step 2: Creating the Agent
Now that we have initialized all the tools, let’s proceed to creating an agent that will use the tools and give us a plan for the trip.
tools = [DestinationTool, CurrencyTool] agent = initialize_agent( tools=tools, llm=llm, agent_type="zero-shot-react-description", verbose=True ) def trip_planner(city, usd_budget): dest = get_destinations(city) inr_budget = convert_usd_to_inr(f"{usd_budget} USD to INR") return f"""Here is your travel plan: *Top spots in {city}*: {dest} *Budget*: {inr_budget} Enjoy your day trip!"""
- Initialize_agent: This function creates an agent with Langchain using a zero-shot reaction approach, which allows the agent to understand the tool descriptions.
- Agent_type: “zero-shot-react-description” enables the agent LLM to decide which tool it should use in a certain situation without prior knowledge, by using the tool description and input.
- Verbose: Verbose enables the logging of the agent’s thought process, so we can monitor each decision that the agent makes, including all the interactions and tools invoked.
- trip_planner: This is a python function that manually calls tools instead of relying on the agent. It allows the user to select the best tool for a particular problem.
Step 3: Initializing the Agent
In this section, we’ll be initializing the agent and observing its response.
# Initialize the Agent city = "Delhi" usd_budget = 8500 # Run the multi-agent planner response = agent.run(f"Plan a day trip to {city} with a budget of {usd_budget} USD") from IPython.display import Markdown, display display(Markdown(response))
- Invocation of agent: agent.run() uses the user’s intent via prompt and plans the trip.
Output
Building a RAG System using Qwen3
In this section, we’ll be creating a RAG bot that answers any query within the relevant input document from the knowledge base. This gives an informative response using qwen/qwen3-235b-a22b. The system would also be using Langchain, to produce accurate and context-aware responses.
Step 1: Setting up the Libraries and Tools
First, we will be installing and importing the necessary libraries and tools required to build the RAG system.
!pip install langchain langchain-community langchain-core openai tiktoken chromadb sentence-transformers duckduckgo-search from langchain_community.document_loaders import TextLoader from langchain.text_splitter import CharacterTextSplitter from langchain_community.vectorstores import Chroma from langchain.embeddings import HuggingFaceEmbeddings from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI # Load your document loader = TextLoader("/content/my_docs.txt") docs = loader.load()
- Loading Documents: The “TextLoader” class of Langchain loads the document like a pdf, txt, or doc file which will be used for the Q/A retrieval. Here I’ve uploaded my_docs.txt.
- Selecting the Vector Setup: I have used ChromaDB to store and search the embeddings from our vector database for the Q/A process.
Step 2: Creating the Embeddings
Now that we’ve loaded our document, let’s proceed to creating embeddings out of it which will help in easing the retrieval process.
# Split into chunks splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50) chunks = splitter.split_documents(docs) # Embed with HuggingFace model embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") db = Chroma.from_documents(chunks, embedding=embeddings) # Setup Qwen LLM from OpenRouter llm = ChatOpenAI( base_url="https://openrouter.ai/api/v1", api_key="YOUR_API_KEY", model="qwen/qwen3-235b-a22b:free" ) # Create RAG chain retriever = db.as_retriever(search_kwargs={"k": 2}) rag_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
- Document Splitting: The CharacterTextSplitter() splits the text into smaller chunks, which will mainly help in two things. First, it eases the retrieval process, and second, it helps in retaining the context from the previous chunk via chunk_overlap.
- Embedding Documents: Embeddings convert the text into the embedding vectors of a set dimension for each token. Here we have used chunk_size of 300, which means every word/token will be converted into a vector of 300 dimensions. Now this vector embedding will have all the contextual information of that word with respect to the other words in the chunk.
- RAG Chain: RAG chain combines the ChromaDB with the LLM to form a RAG. This enables us to get contextually aware answers from the document as well as from the model.
Step 3: Initializing the RAG System
# Ask a question response = rag_chain.invoke({"query": "How can i use Qwen with MCP. Please give me a stepwise guide along with the necessary code snippets"}) display(Markdown(response['result']))
- Query Execution: The rag_chain_invoke() method will send the user’s query to the RAG system, which then retrieves the relevant context-aware chunks from the document store (vector db) and generates a context-aware answer.
Output
You can find the complete code here.
Applications of Qwen3
Here are some more applications of Qwen3 across industries:
- Automated Coding: Qwen3 can generate, debug, and provide documentation for code, which helps developers to solve errors without manual effort. Its 22B parameter model excels in coding, with performances comparable to models like DeepSeek-R1, Gemini 2.5 Pro, and OpenAI’s o3-mini.
- Education and Research: Qwen3 archives high accuracy in math, physics, and logical reasoning problem solving. It also rivals the Gemini 2.5 Pro, while excels with models such as OpenAI’s o1, o3-mini, DeepSeek-R1, and Grok 3 Beta.
- Agent-Based Tool Integration: Qwen3 also leads in AI agent tasks by allowing the use of external tools, APIs, and MCPs for multi-step and multi-agentic workflows with its tool-calling template, which further simplifies the agentic interaction.
- Advanced Reasoning Tasks: Qwen3 uses an extensive thinking capability to deliver optimal and accurate responses. The model uses chain-of-thought reasoning for complex tasks and a non-thinking mode for optimized speed.
Conclusion
In this article, we have learned how to build Qwen3-powered agentic AI and RAG systems. Qwen3’s high performance, multilingual support, and advanced reasoning capability make it a strong choice for knowledge retrieval and agent-based tasks. By integrating Qwen3 into RAG and agentic pipelines, we can get accurate, context-aware, and smooth responses, making it a strong contender for real-world applications for AI-powered systems.
Frequently Asked Questions
Q1. How does Qwen3 differ from other LLMs for RAG?A. Qwen3 has a hybrid reasoning capability that allows it to make dynamic changes in the responses, which allows it to optimize the RAG workflows for both retrieval and complex analysis.
Q2. What are the tools needed to integrate RAG?A. It majorly includes the Vector database, Embedding models, Langchain workflow and an API to access the model.
Q3. Can Qwen3 allow the multistep tool chaining in the agent workflow?Yes, with the Qwen-agent built-in tool calling templates, we can parse and enable sequential tool operations like web searching, data analysis, and report generation.
Q4. How to reduce latency in Qwen3 agent responses?A. One can reduce the latency in many ways, some of them are:
1. Use of MOE models like Qwen3-30B-A3B, which only have 3 billion active parameters.
2. By using GPU-optimized inferences.
A. The common error includes:
1. MCP server initialization failures, like json formatting and INIT.
2. Tool response pairing errors.
3. Context window overflow.
The above is the detailed content of How to Build RAG Systems and AI Agents with Qwen3. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

Hey there, Coding ninja! What coding-related tasks do you have planned for the day? Before you dive further into this blog, I want you to think about all your coding-related woes—better list those down. Done? – Let’

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

Shopify CEO Tobi Lütke's recent memo boldly declares AI proficiency a fundamental expectation for every employee, marking a significant cultural shift within the company. This isn't a fleeting trend; it's a new operational paradigm integrated into p

Introduction OpenAI has released its new model based on the much-anticipated “strawberry” architecture. This innovative model, known as o1, enhances reasoning capabilities, allowing it to think through problems mor

Introduction Imagine walking through an art gallery, surrounded by vivid paintings and sculptures. Now, what if you could ask each piece a question and get a meaningful answer? You might ask, “What story are you telling?

Meta's Llama 3.2: A Multimodal AI Powerhouse Meta's latest multimodal model, Llama 3.2, represents a significant advancement in AI, boasting enhanced language comprehension, improved accuracy, and superior text generation capabilities. Its ability t

For those of you who might be new to my column, I broadly explore the latest advances in AI across the board, including topics such as embodied AI, AI reasoning, high-tech breakthroughs in AI, prompt engineering, training of AI, fielding of AI, AI re
