How to build an AI assistant with OpenAI, Vercel AI SDK, and Ollama with Next.js-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

How to build an AI assistant with OpenAI, Vercel AI SDK, and Ollama with Next.js

DDD

Nov 07, 2024 am 03:30 AM

In today’s blog post, we’ll build an AI Assistant using three different AI models: Whisper and TTS from OpenAI and Llama 3.1 from Meta.

While exploring AI, I wanted to try different things and create an AI assistant that works by voice. This curiosity led me to combine OpenAI’s Whisper and TTS models with Meta’s Llama 3.1 to build a voice-activated assistant.

Here’s how these models will work together:

First, we’ll send our audio to the Whisper model, which will convert it from speech to text.
Next, we’ll pass that text to the Llama 3.1 model. Llama will understand the text and generate a response.
Finally, we’ll take Llama’s response and send it to the TTS model, turning the text back into speech. We’ll then stream that audio back to the client.

Let’s dive in and start building this excellent AI Assistant!

Getting started

We will use different tools to build our assistant. To build our client side, we will use Next.js. However, you could choose whichever framework you prefer.

To use our OpenAI models, we will use their TypeScript / JavaScript SDK. To use this API, we require the following environmental variable: OPENAI_API_KEY—

To get this key, we need to log in to the OpenAI dashboard and find the API keys section. Here, we can generate a new key.

Open AI dashboard inside the API keys section

Awesome. Now, to use our Llama 3.1 model, we will use Ollama and the Vercel AI SDK, utilizing a provider called ollama-ai-provider.

Ollama will allow us to download our preferred model (we could even use a different one, like Phi) and run it locally. The Vercel SDK will facilitate its use in our Next.js project.

To use Ollama, we just need to download it and choose our preferred model. For this blog post, we are going to select Llama 3.1. After installing Ollama, we can verify if it is working by opening our terminal and writing the following command:

Terminal, with the command ‘ollama run llama3.1’

Notice that I wrote “llama3.1” because that’s my chosen model, but you should use the one you downloaded.

Kicking things off

It's time to kick things off by setting up our Next.js app. Let's start with this command:

npx create-next-app@latest

Copy after login

After running the command, you’ll see a few prompts to set the app's details. Let's go step by step:

Name your app.
Enable app router.

The other steps are optional and entirely up to you. In my case, I also chose to use TypeScript and Tailwind CSS.

Now that’s done, let’s go into our project and install the dependencies that we need to run our models:

npx create-next-app@latest

Copy after login

Building our client logic

Now, our goal is to record our voice, send it to the backend, and then receive a voice response from it.

To record our audio, we need to use client-side functions, which means we need to use client components. In our case, we don’t want to transform our whole page to use client capabilities and have the whole tree in the client bundle; instead, we would prefer to use Server components and import our client components to progressively enhance our application.

So, let’s create a separate component that will handle the client-side logic.

Inside our app folder, let's create a components folder, and here, we will be creating our component:

npm i ai ollama-ai-provider openai

Copy after login

Let’s go ahead and initialize our component. I went ahead and added a button with some styles in it:

app
 ↳components
  ↳audio-recorder.tsx

Copy after login

And then import it into our Page Server component:

// app/components/audio-recorder.tsx
'use client'
export default function AudioRecorder() {
    function handleClick(){
      console.log('click')
    }

    return (
        <section>
        <button onClick={handleClick}
                    className={`bg-blue-500 text-white px-4 py-2 rounded shadow-md hover:bg-blue-400 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 focus:ring-offset-white transition duration-300 ease-in-out absolute top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2`}>
                Record voice
            </button>
        </section>
    )
}

Copy after login

Now, if we run our app, we should see the following:

First look of the app, showing a centered blue button

Awesome! Now, our button doesn’t do anything, but our goal is to record our audio and send it to someplace; for that, let us create a hook that will contain our logic:

// app/page.tsx
import AudioRecorder from '@/app/components/audio-recorder';

export default function Home() {
  return (
      <AudioRecorder />
  );
}

Copy after login

We will use two APIs to record our voice: navigator and MediaRecorder. The navigator API will give us information about the user’s media devices like the user media audio, and the MediaRecorder will help us record the audio from it. This is how they’re going to play out together:

app
 ↳hooks
  ↳useRecordVoice.ts

import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
  return {}
}

Copy after login

Let’s explain this code step by step. First, we create two new states. The first one is for keeping track of when we are recording, and the second one stores the instance of our MediaRecorder.

// apps/hooks/useRecordVoice.ts
import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
    const [isRecording, setIsRecording] = useState(false);
    const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);

     const startRecording = async () => {
        if(!navigator?.mediaDevices){
            console.error('Media devices not supported');
            return;
        }

        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        setIsRecording(true)
        setMediaRecorder(mediaRecorder);
        mediaRecorder.start(0)
    }

    const stopRecording = () =>{
        if(mediaRecorder) {
            setIsRecording(false)
            mediaRecorder.stop();
        }
    }

  return {
    isRecording,
    startRecording,
    stopRecording,
  }
}

Copy after login

Then, we’ll create our first method, startRecording. Here, we are going to have the logic to start recording our audio.
We first check if the user has media devices available thanks to the navigator API that gives us information about the browser environment of our user:

If we don’t have media devices to record our audio, we just return. If they do, then let us create a stream using their audio media device.

 const [isRecording, setIsRecording] = useState(false);
    const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);

Copy after login

Finally, we go ahead and create an instance of a MediaRecorder to record this audio:

npx create-next-app@latest

Copy after login

Then we need a method to stop our recording, which will be our stopRecording. Here, we will just stop our recording in case a media recorder exists.

npm i ai ollama-ai-provider openai

Copy after login

We are recording our audio, but we are not storing it anywhere. Let’s add a new useEffect and ref to accomplish this.
We would need a new ref, and this is where our chunks of audio data will be stored.

app
 ↳components
  ↳audio-recorder.tsx

Copy after login

In our useEffect we are going to do two main things: store those chunks in our ref, and when it stops, we are going to create a new Blob of type audio/mp3:

// app/components/audio-recorder.tsx
'use client'
export default function AudioRecorder() {
    function handleClick(){
      console.log('click')
    }

    return (
        <section>
        <button onClick={handleClick}
                    className={`bg-blue-500 text-white px-4 py-2 rounded shadow-md hover:bg-blue-400 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 focus:ring-offset-white transition duration-300 ease-in-out absolute top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2`}>
                Record voice
            </button>
        </section>
    )
}

Copy after login

It is time to wire this hook with our AudioRecorder component:

// app/page.tsx
import AudioRecorder from '@/app/components/audio-recorder';

export default function Home() {
  return (
      <AudioRecorder />
  );
}

Copy after login

Let’s go to the other side of the coin, the backend!

Setting up our Server side

We want to use our models on the server to keep things safe and run faster. Let’s create a new route and add a handler for it using route handlers from Next.js. In our App folder, let’s make an “Api” folder with the following route in it:

app
 ↳hooks
  ↳useRecordVoice.ts

import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
  return {}
}

Copy after login

Our route is called ‘chat’. In the route.ts file, we’ll set up our handler. Let’s start by setting up our OpenAI SDK.

// apps/hooks/useRecordVoice.ts
import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
    const [isRecording, setIsRecording] = useState(false);
    const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);

     const startRecording = async () => {
        if(!navigator?.mediaDevices){
            console.error('Media devices not supported');
            return;
        }

        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        setIsRecording(true)
        setMediaRecorder(mediaRecorder);
        mediaRecorder.start(0)
    }

    const stopRecording = () =>{
        if(mediaRecorder) {
            setIsRecording(false)
            mediaRecorder.stop();
        }
    }

  return {
    isRecording,
    startRecording,
    stopRecording,
  }
}

Copy after login

In this route, we’ll send the audio from the front end as a base64 string. Then, we’ll receive it and turn it into a Buffer object.

 const [isRecording, setIsRecording] = useState(false);
    const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);

Copy after login

It’s time to use our first model. We want to turn this audio into text and use OpenAI’s Whisper Speech-To-Text model. Whisper needs an audio file to create the text. Since we have a Buffer instead of a file, we’ll use their ‘toFile’ method to convert our audio Buffer into an audio file like this:

// check if they have media devices
if(!navigator?.mediaDevices){
 console.error('Media devices not supported');
 return;
}
// create stream using the audio media device
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

Copy after login

Notice that we specified “mp3”. This is one of the many extensions that the Whisper model can use. You can see the full list of supported extensions here: https://platform.openai.com/docs/api-reference/audio/createTranscription#audio-createtranscription-file

Now that our file is ready, let’s pass it to Whisper! Using our OpenAI instance, this is how we will invoke our model:

// create an instance passing in the stream as parameter
const mediaRecorder = new MediaRecorder(stream);
// Set this state to true to 
setIsRecording(true)
// Store the instance in the state
setMediaRecorder(mediaRecorder);
// Start recording inmediately
mediaRecorder.start(0)

Copy after login

That’s it! Now, we can move on to the next step: using Llama 3.1 to interpret this text and give us an answer. We’ll use two methods for this. First, we’ll use ‘ollama’ from the ‘ollama-ai-provider’ package, which lets us use this model with our locally running Ollama. Then, we’ll use ‘generateText’ from the Vercel AI SDK to generate the text.
Side note: To make our Ollama run locally, we need to write the following command in the terminal:

npx create-next-app@latest

Copy after login

npm i ai ollama-ai-provider openai

Copy after login

Finally, we have our last model: TTS from OpenAI. We want to reply to our user with audio, so this model will be really helpful. It will turn our text into speech:

app
 ↳components
  ↳audio-recorder.tsx

Copy after login

The TTS model will turn our response into an audio file. We want to stream this audio back to the user like this:

// app/components/audio-recorder.tsx
'use client'
export default function AudioRecorder() {
    function handleClick(){
      console.log('click')
    }

    return (
        <section>
        <button onClick={handleClick}
                    className={`bg-blue-500 text-white px-4 py-2 rounded shadow-md hover:bg-blue-400 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 focus:ring-offset-white transition duration-300 ease-in-out absolute top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2`}>
                Record voice
            </button>
        </section>
    )
}

Copy after login

And that’s all the whole backend code! Now, back to the frontend to finish wiring everything up.

Putting It All Together

In our useRecordVoice.tsx hook, let's create a new method that will call our API endpoint. This method will also take the response back and play to the user the audio that we are streaming from the backend.

// app/page.tsx
import AudioRecorder from '@/app/components/audio-recorder';

export default function Home() {
  return (
      <AudioRecorder />
  );
}

Copy after login

Great! Now that we’re getting our streamed response, we need to handle it and play the audio back to the user. We’ll use the AudioContext API for this. This API allows us to store the audio, decode it and play it to the user once it’s ready:

app
 ↳hooks
  ↳useRecordVoice.ts

import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
  return {}
}

Copy after login

And that's it! Now the user should hear the audio response on their device. To wrap things up, let's make our app a bit nicer by adding a little loading indicator:

// apps/hooks/useRecordVoice.ts
import { useEffect, useRef, useState } from 'react';

export function useRecordVoice() {
    const [isRecording, setIsRecording] = useState(false);
    const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);

     const startRecording = async () => {
        if(!navigator?.mediaDevices){
            console.error('Media devices not supported');
            return;
        }

        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        setIsRecording(true)
        setMediaRecorder(mediaRecorder);
        mediaRecorder.start(0)
    }

    const stopRecording = () =>{
        if(mediaRecorder) {
            setIsRecording(false)
            mediaRecorder.stop();
        }
    }

  return {
    isRecording,
    startRecording,
    stopRecording,
  }
}

Copy after login

Conclusion

In this blog post, we saw how combining multiple AI models can help us achieve our goals. We learned to run AI models like Llama 3.1 locally and use them in our Next.js app. We also discovered how to send audio to these models and stream back a response, playing the audio back to the user.

This is just one of many ways you can use AI—the possibilities are endless. AI models are amazing tools that let us create things that were once hard to achieve with such quality. Thanks for reading; now, it’s your turn to build something amazing with AI!

You can find the complete demo on GitHub: AI Assistant with Whisper TTS and Ollama using Next.js

The above is the detailed content of How to build an AI assistant with OpenAI, Vercel AI SDK, and Ollama with Next.js. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1671

CakePHP Tutorial

1428

Laravel Tutorial

1331

PHP Tutorial

1276

C# Tutorial

1256

Related knowledge

Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

From C/C to JavaScript: How It All Works Apr 14, 2025 am 12:05 AM

The shift from C/C to JavaScript requires adapting to dynamic typing, garbage collection and asynchronous programming. 1) C/C is a statically typed language that requires manual memory management, while JavaScript is dynamically typed and garbage collection is automatically processed. 2) C/C needs to be compiled into machine code, while JavaScript is an interpreted language. 3) JavaScript introduces concepts such as closures, prototype chains and Promise, which enhances flexibility and asynchronous programming capabilities.

JavaScript and the Web: Core Functionality and Use Cases Apr 18, 2025 am 12:19 AM

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

JavaScript in Action: Real-World Examples and Projects Apr 19, 2025 am 12:13 AM

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

Understanding the JavaScript Engine: Implementation Details Apr 17, 2025 am 12:05 AM

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Python vs. JavaScript: Development Environments and Tools Apr 26, 2025 am 12:09 AM

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

The Role of C/C in JavaScript Interpreters and Compilers Apr 20, 2025 am 12:01 AM

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

See all articles