Home Technology peripherals AI Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source

Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source

Apr 12, 2024 am 08:04 AM
git industry opensea ai database

The combination of large models and AI databases has become a magic weapon for cost reduction and efficiency improvement of large models and truly intelligent big data.

长文本杀不死RAG:SQL+向量驱动大模型和大数据新范式,MyScale AI数据库正式开源

The wave of large models (LLM) has been surging for more than a year, especially with GPT-4, Gemini-1.5, Claude-3 The models represented by You Fang and I will appear on stage and become a well-deserved center of attention. On the LLM track, some research focuses on increasing model parameters, and some are crazy about multi-modality... Among them, LLM's ability to process context length has become an important indicator for evaluating models. A stronger context means that the model Have stronger retrieval performance. For example, the ability of some models to process up to 1 million tokens in one go has led many researchers to think about whether the RAG (Retrieval-Augmented Generation) method is still necessary?

Some people think that RAG will be killed by the long context model, but this view has been refuted by many researchers and architects. They believe that on the one hand, data structures are complex, change regularly, and many data have important time dimensions, which may be too complex for LLM. On the other hand, it is unrealistic to put all the massive heterogeneous data of enterprises and industries into the context window. The combination of large models and AI databases injects professional, accurate and real-time information into the generative AI system, greatly reducing illusions and improving the practicality of the system. At the same time, the Data-centric LLM method can also take advantage of the massive data management and query capabilities of AI databases to significantly reduce the cost of large model training and fine-tuning, and support small sample tuning in different scenarios of the system. In summary, the combination of large models and AI databases not only reduces costs and increases efficiency for large models, but also makes big data truly intelligent.

After several years of development and iteration, MyScaleDB is finally open source

The emergence of RAG makes LLM can accurately extract information from large-scale knowledge bases and generate real-time, professional, and insightful answers. Along with this, the vector database, the core function of the RAG system, has also developed rapidly. According to the design concept of vector database, we can roughly divide it into three categories: dedicated vector database, retrieval system combining keywords and vectors, and SQL vector database.

  • Specialized vector databases represented by Pinecone/Weaviate/Milvus were designed and built for vector retrieval from the beginning. The vector retrieval performance is excellent, but it is not universal. The data management function is weak.
  • Keyword and vector retrieval systems represented by Elasticsearch/OpenSearch are widely used in production because of their complete keyword retrieval functions. However, they occupy a lot of system resources, and keywords and vectors The accuracy and performance of the joint query are not satisfactory.
  • SQL vector databases represented by pgvector (vector search plug-in for PostgreSQL) and MyScale AI database are based on SQL and have powerful data management functions. However, due to the disadvantages of PostgreSQL row storage and the limitations of vector algorithms, pgvector has low accuracy in complex vector queries.
MyScale AI Database (MyScaleDB) Based on a high-performance SQL column storage database, self-developed with high performance and high data density Vector index algorithm, and the retrieval and storage engine have been deeply developed and optimized for joint queries of SQL and vectors. is the world's first SQL vector database product whose comprehensive performance and cost-effectiveness greatly exceeds that of a dedicated vector database.

Thanks to the long-term polishing of SQL database in massive structured data scenarios, MyScaleDB supports both massive vector and structured data, including strings, Efficient storage and query of multiple data types such as JSON, space, time series, etc., and will launch powerful inverted table and keyword retrieval functions in the near future to further improve the accuracy of the RAG system and replace systems such as Elasticsearch.

长文本杀不死RAG:SQL+向量驱动大模型和大数据新范式,MyScale AI数据库正式开源

长文本杀不死RAG:SQL+向量驱动大模型和大数据新范式,MyScale AI数据库正式开源

After nearly 6 years of development and several version iterations, MyScaleDB has recently been open sourced. All developers and enterprise users are welcome to star on GitHub and open up a new way of using SQL to build production-level AI applications!

Project address: https://github.com/myscale/myscaledb
Fully compatible with SQL, improved accuracy , cost reduction

With the help of complete SQL data management capabilities, powerful and efficient structured, vector and heterogeneous data storage and query capabilities, MyScaleDB is expected to become the first An AI database that is truly oriented to large models and big data.

Native compatibility with SQL and vectors

Half a century since the birth of SQL , despite experiencing waves such as NoSQL and big data, the ever-evolving SQL database still occupies a major share of the data management market, and even retrieval and big data systems such as Elasticsearch and Spark have successively supported SQL interfaces. Although dedicated vector databases have been optimized and system designed for vectors, their query interfaces usually lack standardization and do not have advanced query languages. This results in weak generalization capabilities of the interface. For example, Pinecone’s query interface does not even include specifying the fields to be retrieved, let alone common database functions such as paging and aggregation.

#The weak generalization ability of the interface means that it changes frequently, which increases the learning cost. The MyScale team believes that
the systematically optimized SQL and vector system can maintain complete SQL support while ensuring high performance of vector retrieval, and the results of their open source evaluation have fully demonstrated this. .

In actual complex AI application scenarios, the combination of SQL and vectors can greatly increase the flexibility of data modeling and simplify the development process. For example, in the Science Navigator project cooperating between the MyScale team and the Beijing Institute of Scientific Intelligence, MyScaleDB is used to retrieve massive scientific literature data and perform intelligent question answering. There are more than 10 main SQL table structures, many of which establish vectors. And inverted table index, and use the primary key and foreign key to make the association. In actual queries, the system will also involve joint queries of structured, vector and keyword data, as well as related queries of several tables. These modeling and correlations are difficult to achieve in a dedicated vector database, which will also lead to slow iteration of the final system, inefficient querying and difficult maintenance.

                           Science Navigator main table structure diagram (bold columns establish vector indexes or inverted indexes)
Support joint query of structured, vector and keyword data

In the actual RAG system, the accuracy and effect of retrieval are the main bottlenecks restricting its implementation. This requires the AI ​​database to efficiently support joint queries of structured, vector and keyword data to comprehensively improve retrieval accuracy.

For example, in a financial scenario, the user needs to query the document library "What is the revenue of a certain company's global businesses in 2023?", "A certain company", "2023 Year" and other structured meta-information cannot be well captured by vectors, and may not even be directly reflected in the corresponding paragraphs. Performing vector retrieval directly on the entire database will obtain a large amount of noise information and reduce the final accuracy of the system. On the other hand, company name, year, etc. can usually be obtained as meta-information of the document. We can use WHERE year=2023 AND company ILIKE "%%" as the filter condition of vector query to accurately locate Relevant information is obtained, which greatly improves the reliability of the system. In finance, manufacturing, scientific research and other scenarios, the MyScale team has observed the power of heterogeneous data modeling and related queries. In many scenarios, the accuracy is even 60% to 90% improvement.
Although traditional database products have gradually realized the importance of vector queries in the AI ​​era and have begun to add vector capabilities to the database, there are still significant problems with the accuracy of their joint queries. . For example, in the scenario of filtering queries, when the filtering ratio is 0.1, the QPS of Elasticsearch will drop to only about 5, while the retrieval accuracy of PostgresSQL (using the pgvector plug-in) is only about 50% when the filtering ratio is 0.01, making the query unstable. Accuracy/performance greatly restricts its application scenarios. And MyScale only uses 36% of the cost of pgvector and 12% of the cost of ElasticSearch, to achieve high performance and high precision queries in various scenarios with different filtering ratios.

长文本杀不死RAG:SQL+向量驱动大模型和大数据新范式,MyScale AI数据库正式开源

In different filtering proportion scenarios, myscale achieves high precision and high performance query

## This
The balance between performance and cost in real scenarios

Because of the importance and high attention of vector retrieval in large model applications, more and more The team invested in the vector database track. Everyone’s initial focus was on improving QPS in pure vector search scenarios, but
pure vector search is far from enough
! In actual combat scenarios, data modeling, query flexibility and accuracy, and balancing data density, query performance and cost are more important issues.
In the RAG scenario, pure vector query performance has a 10x excess, vectors occupy huge resources, lack of joint query functions, poor performance and accuracy are often the result of current proprietary vectors Database normality.
MyScaleDB is committed to improving the comprehensive performance of AI databases in real massive data scenarios
. Its MyScale Vector Database Benchmark is also the first in the industry to compare mainstream vector database systems with a scale of five million vectors and different query scenarios. An open source evaluation system for performance and cost-effectiveness. Everyone is welcome to pay attention and raise issues. The MyScale team said that there is still a lot of room for optimization of the AI ​​database in real application scenarios, and they also hope to continue to polish the product and improve the evaluation system in practice.
MyScale Vector Database Benchmark project address:
https://github.com/myscale/vector-db-benchmark
Outlook: Big model big data Agent platform supported by AI database

Machine learning big data drives the Internet and the Internet The success of a generation of information systems, and in the context of the era of large models, the MyScale team is also committed to proposing a new generation of large model and big data solutions. With high-performance SQL vector database as a solid support, MyScaleDB provides the key capabilities of large-scale data processing, knowledge query, observability, data analysis and small sample learning, building an AI and data closed loop, Become the key base of the
next generation big model big data Agent platform

. The MyScale team has already explored the implementation of this solution in scientific research, finance, industry, medical and other fields. 长文本杀不死RAG:SQL+向量驱动大模型和大数据新范式,MyScale AI数据库正式开源

###With the rapid development of technology, some sense of artificial general intelligence (AGI) is expected to appear in the next 5-10 years. Regarding this issue, we can’t help but think: Is a large model that is static, virtual, and competitive with humans needed, or is there another more comprehensive solution? Data is undoubtedly an important link between large models, the world, and users. The MyScale team's vision is to organically combine large models and big data to create an AI system that is more professional, real-time, and efficient in collaboration, but also full of human warmth and value. ###

The above is the detailed content of Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1662
14
PHP Tutorial
1261
29
C# Tutorial
1234
24
How to download git projects to local How to download git projects to local Apr 17, 2025 pm 04:36 PM

To download projects locally via Git, follow these steps: Install Git. Navigate to the project directory. cloning the remote repository using the following command: git clone https://github.com/username/repository-name.git

How to update code in git How to update code in git Apr 17, 2025 pm 04:45 PM

Steps to update git code: Check out code: git clone https://github.com/username/repo.git Get the latest changes: git fetch merge changes: git merge origin/master push changes (optional): git push origin master

How to merge code in git How to merge code in git Apr 17, 2025 pm 04:39 PM

Git code merge process: Pull the latest changes to avoid conflicts. Switch to the branch you want to merge. Initiate a merge, specifying the branch to merge. Resolve merge conflicts (if any). Staging and commit merge, providing commit message.

How to use git commit How to use git commit Apr 17, 2025 pm 03:57 PM

Git Commit is a command that records file changes to a Git repository to save a snapshot of the current state of the project. How to use it is as follows: Add changes to the temporary storage area Write a concise and informative submission message to save and exit the submission message to complete the submission optionally: Add a signature for the submission Use git log to view the submission content

How to solve the efficient search problem in PHP projects? Typesense helps you achieve it! How to solve the efficient search problem in PHP projects? Typesense helps you achieve it! Apr 17, 2025 pm 08:15 PM

When developing an e-commerce website, I encountered a difficult problem: How to achieve efficient search functions in large amounts of product data? Traditional database searches are inefficient and have poor user experience. After some research, I discovered the search engine Typesense and solved this problem through its official PHP client typesense/typesense-php, which greatly improved the search performance.

What to do if the git download is not active What to do if the git download is not active Apr 17, 2025 pm 04:54 PM

Resolve: When Git download speed is slow, you can take the following steps: Check the network connection and try to switch the connection method. Optimize Git configuration: Increase the POST buffer size (git config --global http.postBuffer 524288000), and reduce the low-speed limit (git config --global http.lowSpeedLimit 1000). Use a Git proxy (such as git-proxy or git-lfs-proxy). Try using a different Git client (such as Sourcetree or Github Desktop). Check for fire protection

How to update local code in git How to update local code in git Apr 17, 2025 pm 04:48 PM

How to update local Git code? Use git fetch to pull the latest changes from the remote repository. Merge remote changes to the local branch using git merge origin/<remote branch name>. Resolve conflicts arising from mergers. Use git commit -m "Merge branch <Remote branch name>" to submit merge changes and apply updates.

How to delete a repository by git How to delete a repository by git Apr 17, 2025 pm 04:03 PM

To delete a Git repository, follow these steps: Confirm the repository you want to delete. Local deletion of repository: Use the rm -rf command to delete its folder. Remotely delete a warehouse: Navigate to the warehouse settings, find the "Delete Warehouse" option, and confirm the operation.

See all articles