


Exploring Text-Embedding-3-Large: A Comprehensive Guide to the new OpenAI Embeddings
OpenAI's latest text embedding models, text-embedding-3-large
and text-embedding-3-small
, are revolutionizing text analysis. This article explores their capabilities, applications, and practical usage.
Embeddings translate human language into machine-readable formats, crucial for AI tasks. OpenAI's new models significantly improve this process for developers and data scientists. We'll cover their core functions, applications, and effective implementation.
Understanding Text Embeddings
Text embeddings are numerical representations capturing the semantic meaning of text. They are essential for various NLP tasks, including sentiment analysis and text classification. Our guide, "Introduction to Text Embeddings with the OpenAI API," provides a comprehensive overview of using the OpenAI API for embedding creation.
Text Embeddings Illustration
Newcomers to embeddings should consult our "Introduction to Embeddings with the OpenAI API" course.
OpenAI's New Embedding Models
Released January 25, 2024, these models represent text in high-dimensional space for improved understanding. text-embedding-3-small
prioritizes speed and storage, while text-embedding-3-large
offers superior accuracy. The dimensions
parameter allows adjusting text-embedding-3-large
to 1536 dimensions (from its native 3072) without significant performance loss.
Benchmarking
text-embedding-3-large
surpasses previous models (including text-embedding-ada-002
) on MIRACL and MTEB benchmarks. The table below summarizes the comparison:
|
Dimension | Max token | Knowledge cutoff | Pricing ($/1k tokens) | MIRACL average | MTEB average | |||||||||||||||||||||||
ada v2 | 1536 | 8191 | September 2021 | 0.0001 | 31.4 | 61.0 | |||||||||||||||||||||||
text-embedding-3-small | 0.00002 | 44.0 | 62.3 | ||||||||||||||||||||||||||
text-embedding-3-large | 3072 | 0.00013 | 54.9 | 64.6 |
Higher dimensions in text-embedding-3-large
(3072 vs. 1536) enhance performance but increase cost. Model selection depends on task requirements (multilingual needs, text complexity, budget). text-embedding-3-large
excels in complex, multilingual scenarios, while text-embedding-3-small
suits budget-conscious applications.
Applications
Both models find diverse applications:
text-embedding-3-large
Applications:
Applications of text-embedding-3-large (images generated using GPT-4)
- Multilingual customer support automation (18 languages)
- Advanced semantic search engines
- Cross-lingual content recommendation systems
text-embedding-3-small
Applications:
Applications of text-embedding-3-small (Image generated using GPT-4)
- Cost-effective sentiment analysis
- Scalable content categorization
- Efficient language learning tools
Step-by-Step Guide: Document Similarity
This guide uses the CORD-19 dataset (available on Kaggle) to demonstrate document similarity using all three models. Install necessary libraries:
pip -q install tiktoken openai
Import libraries:
import os import tiktoken import numpy as np import pandas as pd from openai import OpenAI from sklearn.metrics.pairwise import cosine_similarity
Load and preprocess data (a 1000-document sample is used for brevity):
scientific_docs = pd.read_parquet("./data/cord19_df_sample.parquet") def concatenate_columns_with_null_handling(df, body_text_column, abstract_column, title_column, new_col_name): df[new_col_name] = df[body_text_column].fillna('') + df[abstract_column].fillna('') + df[title_column].fillna('') return df new_scientific_docs = concatenate_columns_with_null_handling(scientific_docs, "body_text", "abstract", "title", "concatenated_text") def num_tokens_from_text(text: str, encoding_name="cl100k_base"): encoding = tiktoken.get_encoding(encoding_name) num_tokens = len(encoding.encode(text)) return num_tokens new_scientific_docs['num_tokens'] = new_scientific_docs["concatenated_text"].apply(lambda x: num_tokens_from_text(x)) smaller_tokens_docs = new_scientific_docs[new_scientific_docs['num_tokens'] <= 8191] smaller_tokens_docs_reset = smaller_tokens_docs.reset_index(drop=True)
Set OpenAI API key and create client:
os.environ["OPENAI_API_KEY"] = "YOUR KEY" client = OpenAI()
Generate embeddings:
def get_embedding(text_to_embbed, model_ID): text = text_to_embbed.replace("\n", " ") return client.embeddings.create(input=[text_to_embbed], model=model_ID).data[0].embedding smaller_tokens_docs_reset['text-embedding-3-small'] = smaller_tokens_docs_reset["concatenated_text"].apply(lambda x: get_embedding(x, "text-embedding-3-small")) smaller_tokens_docs_reset['text-embedding-3-large'] = smaller_tokens_docs_reset["concatenated_text"].apply(lambda x: get_embedding(x, "text-embedding-3-large")) smaller_tokens_docs_reset['text-embedding-ada-002'] = smaller_tokens_docs_reset["concatenated_text"].apply(lambda x: get_embedding(x, "text-embedding-ada-002"))
Find similar documents using cosine similarity:
def find_top_N_similar_documents(df, chosen_index, embedding_column_name, top_N=3): chosen_document_embedding = np.array(df.iloc[chosen_index][embedding_column_name]).reshape(1, -1) embedding_matrix = np.vstack(df[embedding_column_name]) similarity_scores = cosine_similarity(chosen_document_embedding, embedding_matrix)[0] df_temp = df.copy() df_temp['similarity_to_chosen'] = similarity_scores similar_documents = df_temp.drop(index=chosen_index).sort_values(by='similarity_to_chosen', ascending=False) top_N_similar = similar_documents.head(top_N) return top_N_similar[["concatenated_text", 'similarity_to_chosen']] chosen_index = 0 top_3_similar_3_small = find_top_N_similar_documents(smaller_tokens_docs_reset, chosen_index, "text-embedding-3-small") top_3_similar_3_large = find_top_N_similar_documents(smaller_tokens_docs_reset, chosen_index, "text-embedding-3-large") top_3_similar_ada_002 = find_top_N_similar_documents(smaller_tokens_docs_reset, chosen_index, "text-embedding-ada-002") print("Top 3 Similar Documents with:") print("--> text-embedding-3-small") print(top_3_similar_3_small) print("\n") print("--> text-embedding-3-large") print(top_3_similar_3_large) print("\n") print("--> text-embedding-ada-002") print(top_3_similar_ada_002) print("\n")
Conclusion
OpenAI's new embedding models offer substantial improvements in NLP. The choice between text-embedding-3-large
and text-embedding-3-small
depends on the specific application's needs, balancing accuracy and cost. This guide provides the tools to effectively utilize these powerful models in various projects. Further resources on the OpenAI API and fine-tuning are available.
The above is the detailed content of Exploring Text-Embedding-3-Large: A Comprehensive Guide to the new OpenAI Embeddings. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

Hey there, Coding ninja! What coding-related tasks do you have planned for the day? Before you dive further into this blog, I want you to think about all your coding-related woes—better list those down. Done? – Let’

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

Shopify CEO Tobi Lütke's recent memo boldly declares AI proficiency a fundamental expectation for every employee, marking a significant cultural shift within the company. This isn't a fleeting trend; it's a new operational paradigm integrated into p

Introduction OpenAI has released its new model based on the much-anticipated “strawberry” architecture. This innovative model, known as o1, enhances reasoning capabilities, allowing it to think through problems mor

Introduction Imagine walking through an art gallery, surrounded by vivid paintings and sculptures. Now, what if you could ask each piece a question and get a meaningful answer? You might ask, “What story are you telling?

Meta's Llama 3.2: A Multimodal AI Powerhouse Meta's latest multimodal model, Llama 3.2, represents a significant advancement in AI, boasting enhanced language comprehension, improved accuracy, and superior text generation capabilities. Its ability t

For those of you who might be new to my column, I broadly explore the latest advances in AI across the board, including topics such as embodied AI, AI reasoning, high-tech breakthroughs in AI, prompt engineering, training of AI, fielding of AI, AI re
