Home > Technology peripherals > AI > body text

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

WBOY
Release: 2023-11-11 08:34:46
forward
1150 people have browsed it

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

Video games have become a simulated stage for the real world, showing endless possibilities. Take "Grand Theft Auto" (GTA) as an example. In the game, players can experience the colorful life in the virtual city of Los Santos from a first-person perspective. However, since human players can enjoy playing in Los Santos and complete tasks, can we also have an AI visual model to control the characters in GTA and become the "player" who performs tasks? Can AI players in GTA play the role of a five-star good citizen who obeys traffic rules, helps the police catch criminals, or even be a helpful passerby, helping homeless people find suitable housing?

Current visual-language models (VLMs) have made substantial progress in multi-modal perception and reasoning, but they are usually based on simpler visual question answering (VQA) or visual annotation (Caption) tasks . However, these task settings obviously cannot enable VLM to actually complete tasks in the real world. Because actual tasks not only require the understanding of visual information, but also require the model to have the ability to plan reasoning and provide feedback based on real-time updated environmental information. At the same time, the generated plan also needs to be able to manipulate entities in the environment to realistically complete the task

Although existing language models (LLMs) can perform task planning based on the provided information, they cannot understand visual input. This greatly limits the application scope of language models when performing specific real-world tasks, especially for some embodied intelligence tasks. Text-based input is often too complex or difficult to elaborate, which makes the language model unable to efficiently extract information from it to complete the task. At present, language models have been explored in program generation, but the exploration of generating structured, executable, and robust codes based on visual input has not yet been deeply explored.

In order to solve the problem of how to make large models embodied intelligence To solve the problem of creating autonomous and situation-aware systems that can accurately make plans and execute commands, scholars from Nanyang Technological University in Singapore, Tsinghua University, etc. proposed Octopus. Octopus is a vision-based programmable agent that aims to learn through visual input, understand the real world, and complete various practical tasks by generating executable code. By training on large amounts of data pairs of visual input and executable code, Octopus learned how to control video game characters to complete game tasks or complete complex household activities.

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

  • Paper link: https://arxiv.org/abs/2310.08588

  • Project web page: https://choiszt.github.io/Octopus/

  • Open source code link: https://github.com/dongyh20/Octopus

The content that needs to be rewritten is: data collection and training Rewritten content: Data collection and training

In order to train a visual-language model that can complete embodied intelligence tasks, the researchers also developed OctoVerse, which contains two simulation systems for Provide training data and test environment for Octopus training. These two simulation environments provide available training and testing scenarios for the embodied intelligence of VLM, and put forward higher requirements for the model's reasoning and task planning capabilities. The details are as follows:

1. OctoGibson: Developed based on OmniGibson developed by Stanford University, it includes a total of 476 housework activities that are consistent with real life. The entire simulation environment includes 16 different categories of home scenarios, covering 155 instances of actual home environments. The model can manipulate a large number of interactive objects present in it to complete the final task.

2. OctoGTA: Developed based on the "Grand Theft Auto" (GTA) game, a total of 20 tasks were constructed and generalized into five different scenarios. Players are set at a fixed location through pre-set programs, and items and NPCs necessary to complete the mission are provided to ensure that the mission can proceed smoothly.

The figure below shows the task classification of OctoGibson and some statistical results of OctoGibson and OctoGTA.

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

In order to efficiently collect training data in the two built simulation environments, the researchers established a complete data collection system. By introducing GPT-4 as the task executor, researchers use pre-implemented functions to convert visual input obtained from the simulation environment into textual information and provide it to GPT-4. After GPT-4 returns the task plan and executable code of the current step, it executes the code in the simulation environment and determines whether the task of the current step is completed. If successful, continue to collect visual input for the next step; if failed, return to the starting position of the previous step and collect data again

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

The above figure takes the Cook a Bacon task in the OctoGibson environment as an example to show the complete process of collecting data. It should be pointed out that during the process of collecting data, the researchers not only recorded the visual information during task execution, the executable code returned by GPT-4, etc., but also recorded the success of each sub-task, which will be used as follow-up Reinforcement learning is introduced to build the basis for a more efficient VLM. Although GPT-4 is powerful, it is not impeccable. Errors can manifest in a variety of ways, including syntax errors and physics challenges in the simulator. For example, as shown in Figure 3, between states #5 and #6, the action "put bacon on the pan" failed because the distance between the bacon held by the agent and the pan was too far. Such setbacks reset the task to its previous state. If a task is not completed after 10 steps, it is considered unsuccessful, we will terminate the task due to budget issues, and the data pairs of all subtasks of this task will be considered failed.

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

After the researchers collected a certain scale of training data, they used the data to train an intelligent visual-language model Octopus. The figure below shows the complete data collection and training process. In the first stage, by using the collected data for supervised fine-tuning, the researchers built a VLM model that can receive visual information as input and output in a fixed format. At this stage, the model is able to map visual input information into mission plans and executable code. In the second stage, the researchers introduced RLEF

, which uses reinforcement learning of environmental feedback to further enhance VLM's task planning capabilities by using the success of previously collected subtasks as reward signals to improve the success of the overall task. Rate

Experimental results

The researchers tested the current mainstream VLM and LLM in the built OctoGibson environment. The following table shows the main experimental results. For different test models, Vision Model lists the visual models used by different models. For LLM, researchers process visual information into text as the input of LLM. Among them, O represents providing information about interactive objects in the scene, R represents providing information about the relative relationships of objects in the scene, and GT represents using real and accurate information without introducing additional visual models for detection.

For all test tasks, the researchers reported the complete test integration power, and further divided it into four categories, respectively recording the completion of new tasks in scenarios that exist in the training set, and those that do not exist in the training set. The generalization ability to complete new tasks in different scenarios, as well as the generalization ability for simple following tasks and complex reasoning tasks. For each category of statistics, the researchers reported two evaluation indicators, the first of which is the task completion rate to measure the success rate of the model in completing embodied intelligence tasks; the second is the task planning accuracy, which is used to measure the success rate of the model in completing embodied intelligence tasks. Reflects the model's ability to perform task planning.

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

In addition, the researchers also demonstrated examples of the responses of different models to visual data collected in the OctoGibson simulation environment. The figure below shows the response after using three models of TAPA CodeLLaMA, Octopus and GPT-4V to generate visual input in OctoGibson. It can be seen that compared with the Octopus model and TAPA CodeLLaMA that only undergo supervised fine-tuning, the task planning of the Octopus model trained by RLEF is more reasonable. Even the more vague mission command "find a large bottle" provides a more complete plan. These performances further illustrate the effectiveness of the RLEF training strategy in improving the model's task planning capabilities and reasoning capabilities

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

Overall, the existing models performed well in the simulation environment There is still a lot of room for improvement in actual task completion and task planning capabilities. The researchers summarized some key findings:

1.CodeLLaMA can improve the model's code generation capabilities, but it cannot improve the task planning capabilities.

The researchers pointed out that the experimental results show that CodeLLaMA can significantly improve the code generation capability of the model. Compared with traditional LLM, using CodeLLaMA can obtain better and more executable code. However, although some models use CodeLLaMA to generate code, the overall mission success rate is still limited by mission planning capabilities. For models with weak task planning capabilities, although the generated code is more executable, the final task success rate is still lower. Looking back at Octopus, although CodeLLaMA is not used and the code executability is slightly reduced, due to its powerful task planning capabilities, the overall task success rate is still better than other models

When faced with a large amount of text information When input, LLM processing becomes relatively difficult

During the actual testing process, the researchers compared the experimental results of TAPA and CodeLLaMA and came to the conclusion that it is difficult for the language model to handle long text input well. Researchers follow the TAPA method and use real object information for task planning, while CodeLLaMA uses objects and the relative positional relationships between objects in order to provide more complete information. However, during the experiment, the researchers found that due to the large amount of redundant information in the environment, when the environment is more complex, the text input increases significantly, and it is difficult for LLM to extract valuable clues from the large amount of redundant information, thus reducing the Mission success rate. This also reflects the limitations of LLM, that is, if text information is used to represent complex scenes, a large amount of redundant and worthless input information will be generated.

3.Octopus shows good task generalization ability.

Octopus has strong task generalization capabilities, which can be known from experimental results. In new scenarios that did not appear in the training set, Octopus outperformed existing models in both task completion success rate and task planning success rate. This also shows that the visual-language model has inherent advantages in the same category of tasks, and its generalization performance is better than the traditional LLM

4.RLEF can enhance the task planning ability of the model.

The researchers provide a performance comparison of two models in the experimental results: one is the model that has undergone the first stage of supervised fine-tuning, and the other is the model that has been trained with RLEF. It can be seen from the results that after RLEF training, the overall success rate and planning ability of the model are significantly improved on tasks that require strong reasoning and task planning capabilities. Compared with existing VLM training strategies, RLEF is more efficient. The example plot shows that the model trained with RLEF improves in task planning. When faced with complex tasks, the model can learn to explore the environment; in addition, the model is more in line with the actual requirements of the simulation environment in terms of task planning (for example, the model needs to move to the object to be interacted before it can start interacting), thus reducing the task Risk of planning failure

Discussion

The content that needs to be rewritten is: melting test

In testing the actual capabilities of the model After the evaluation, the researchers further explored possible factors affecting model performance. As shown in the figure below, the researchers conducted experiments from three aspects

The content that needs to be rewritten is: 1. The proportion of training parameters

The researchers conducted comparative experiments and compared training-only Performance of concatenated layers of visual and language models, training concatenated layers and language models, and fully trained models. The results show that as the training parameters increase, the performance of the model gradually improves. This shows that the number of training parameters is critical to whether the model can complete the task in some fixed scenarios

2. Model size

The researchers compared the smaller 3B parameter model with the baseline Performance difference of 7B model in two training stages. The comparison results show that when the overall parameter amount of the model is larger, the performance of the model will also be significantly improved. In future research in the field of VLM, how to select appropriate model training parameters to ensure that the model has the ability to complete the corresponding tasks while also ensuring the lightweight and fast inference speed of the model will be a very critical issue

What needs to be rewritten is: 3. Continuity of visual input. Rewritten content: 3. Coherence of visual input

In order to study the impact of different visual inputs on actual VLM performance, the researchers conducted experiments. During the test, the model rotates sequentially in the simulation environment and collects first-view images and two bird's-eye views, and then inputs these visual images into the VLM in sequence. In experiments, when researchers randomly disrupt the order of visual images and then input them into VLM, the performance of VLM suffers a greater loss. On the one hand, this illustrates the importance of complete and structured visual information to VLM. On the other hand, it also reflects that VLM needs to rely on the intrinsic connection between visual images when responding to visual input. Once this connection is destroyed, it will greatly Affects the performance of VLM

Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here

GPT-4

In addition, the researchers also tested GPT-4 and GPT-4V in the simulation environment The performance has been tested and statistically analyzed.

What needs to be rewritten is: 1. GPT-4

For GPT-4, during the test process, the researcher provides exactly the same text information as input when using it to collect training data. In the test task, GPT-4 can complete half of the tasks. On the one hand, this shows that the existing VLM still has a lot of room for improvement in performance compared to language models such as GPT-4; on the other hand, it also shows that even if It is a language model with strong performance such as GPT-4. When faced with embodied intelligence tasks, its task planning capabilities and task execution capabilities still need to be further improved.

The content that needs to be rewritten is: 2. GPT-4V

Since GPT-4V has just released an API that can be called directly, researchers have not had time to try it yet, but researchers have also manually tested some examples to demonstrate the performance of GPT-4V. Through some examples, researchers believe that GPT-4V has strong zero-sample generalization capabilities for tasks in the simulation environment, and can also generate corresponding executable code based on visual input, but it is slightly inferior to some task planning. The model is fine-tuned on the data collected in the simulation environment.

Summary

The researchers pointed out some limitations of the current work:

The current Octopus model does not perform well when handling complex tasks. When faced with complex tasks, Octopus often makes wrong plans and relies heavily on feedback information from the environment, making it difficult to complete the entire task

2. The Octopus model is only trained in the simulation environment, but how to use it Migrating to the real world will face a series of problems. For example, in the real environment, it will be difficult for the model to obtain more accurate relative position information of objects, and it will become more difficult to build an understanding of the scene by objects.

3. Currently, the visual input of octopus is discrete static pictures, and it will be a future challenge for it to be able to process continuous videos. Continuous videos can further improve the performance of the model in completing tasks, but how to efficiently process and understand continuous visual input will become the key to improving VLM performance

The above is the detailed content of Let the AI ​​model become a GTA five-star player, the vision-based programmable intelligent agent Octopus is here. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:jiqizhixin.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact [email protected]
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!