Introduced for the first time! Using causal inference to do partially observable reinforcement learning-AI-php.cn

Table of Contents

Background

Motivation

Difficulty

Ideas

Method

Home

Technology peripherals

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 15, 2023 am 11:07 AM

ai reinforcement learning

This article "Fast Counterfactual Inference for History-Based Reinforcement Learning" proposes a fast causal inference algorithm that greatly reduces the computational complexity of causal inference - to a level that can be combined with online reinforcement learning.

The theoretical contribution of this article mainly has two points:

1. Proposed the concept of time-averaged causal effect;

2. Extend the famous backdoor criterion from univariate intervention effect estimation to multivariate intervention effect estimation, which is called the step backdoor criterion.

Background

Requires basic knowledge about partially observable reinforcement learning and causal inference. I won’t introduce too much here, just give me a few portals:

Partially Observable Reinforcement Learning:

POMDP explanation https:// www.zhihu.com/zvideo/1326278888684187648

Causal reasoning:

Causal reasoning in deep neural networks https://zhuanlan.zhihu .com/p/425331915

Motivation

Extracting/encoding features from historical information is a basic means to solve partially observable reinforcement learning. The mainstream method is to use the sequence-to-sequence (seq2seq) model to encode history. For example, the LSTM/GRU/NTM/Transformer reinforcement learning methods popular in the field fall into this category. What this type of method has in common is that history is encoded based on the correlation between historical information and learning signals (environmental rewards), that is, the greater the correlation of a piece of historical information, the higher the weight assigned to it. .

However, these methods

cannot eliminate confounding correlations caused by sampling. Give an example of picking up a key to open the door, as shown in the figure below:

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

Here, whether the agent can open the door only depends on whether it has been obtained in the past. key without depending on other states in history. However, if the agent's sampling strategy has preferences for some paths, it will lead to high correlation between states on these preferred paths. For example, after the agent gets the key, it will tend to take the upper path to open the door instead of going to the lower path to open the door, which will make the matter of opening the door highly correlated with the TV. This type of non-causal but highly relevant state will be given a relatively high weight by seq2seq, making the encoded historical information very redundant. In this example, when we estimate the correlation between the TV and the door opening, there is a confounding high correlation between the two due to the presence of the key. To estimate the true effect of a TV door opening, it is necessary to

remove this confounding correlation.

This confounding correlation can be removed through do-calculus in causal inference [1]:

Separate Potentially confusing Backdoor variablesKey and ball, thus cutting off the statistical correlation between the backdoor variable (key/ball) and the TV, and then converting the conditional probability of p(Open|, key/ball) with respect to the backdoor variable (key/ball) Ball)Integrate (right picture of Figure 1), and get the real effect p(Open|do( ))=0.5. Since historical states with causal effects are relatively sparse, when we remove confounding correlations, the scale of historical states can be greatly reduced.

Therefore, we hope to use causal inference to remove confounding correlations in historical samples, and then use seq2seq to encode the history to obtain a more compact historical representation.

(Motivation for this article)

[1] Note: What is considered here is do-calculus adjusted using the backdoor, with a popular science link https://blog.csdn .net/qq_31063727/article/details/118672598

Difficulty

Performing causal inference in historical sequences is different from general causal inference problems. The variables in the historical sequence have both time and space dimensions , that is, observation-time combination Introduced for the first time! Using causal inference to do partially observable reinforcement learning , where o is the observation and t is the timestamp (in contrast, MDP Very friendly, Markov state has only spatial dimension). The overlap of the two dimensions makes the scale of historical observations quite large - Introduced for the first time! Using causal inference to do partially observable reinforcement learning is used to represent the number of observation values at each timestamp, and T is used to represent the total length of time, then the historical state There are kinds of values (the regular form O( ) is the complexity symbol). [2]

Previous causal inference methods were based on univariate intervention detection and could only do one variable at a time. Performing causal reasoning on a large-scale historical state will result in extremely high time complexity, making it difficult to combine with online RL algorithms.

[2] Note: The formal definition of the causal effect of univariate intervention is as follows

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

As shown in the figure above Indicates that given the history , it is necessary to estimate the causal effect of on the transferred variable , do the following two steps: 1) Intervene in the historical state do , 2) Use the previous historical state as the backdoor variable, is the response variable. Calculate the following integral to obtain the required causal effect

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

Since single variable intervention It is difficult to combine detection with online RL, so it is necessary to develop multivariate intervention detection methods.

Ideas

The core observation (hypothesis) of this article is that causal states are sparse in the spatial dimension. This observation is natural and common. For example, when you open a door with a key, many states will be observed during the process, but the observation value of the key determines whether the door can be opened. This observation value accounts for a sparse proportion of all observed values. Taking advantage of this sparsity, we can filter out a large number of historical states without causal effects at once through multi-variable intervention. But the causal effect is not sparse in the time dimension. It is also a key to open the door, and the key can be observed by the agent at most of the time. The density of causal effects in the time dimension prevents us from conducting multivariate interventions—it is impossible to remove a large number of historical states without causal effects at once.

Based on the above two observations, our core idea is to first make inferences in the spatial dimension, and then make inferences in the time dimension. Use the sparsity in the spatial dimension to significantly reduce the number of interventions. In order to separately estimate the spatial causal effect, we propose to first obtain the time average causal effect, which means averaging the causal effects of multiple historical states over time (see the original text for specific definitions).

Based on this idea, we focus on the problem: the core problem to be solved is how to calculateintervening in multiple different time stepsvariables with the same value (note The joint causal effect Introduced for the first time! Using causal inference to do partially observable reinforcement learning of ). This is because the backdoor criterion does not apply to joint intervention of multiple historical variables: as shown in the figure below, consider joint intervention of the double variables Introduced for the first time! Using causal inference to do partially observable reinforcement learning and , you can see that part of the backdoor variables at the later time step of contains , and there is no common backdoor variable between the two.

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

Method

We improve the backdoor criterion and propose a criterion suitable for estimating multivariate joint intervention effect estimates. For any two intervened variables Introduced for the first time! Using causal inference to do partially observable reinforcement learning and (i

Step-backdoor adjustment formula

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

##This criterion separates other variables between the variables of two adjacent time steps, which are called step backdoor variables. In a causal diagram that satisfies this criterion, we can estimate the joint causal effect of any two intervening variables. It includes two steps: step 1. Use variables smaller than i at the time step as backdoor variables to estimate do Introduced for the first time! Using causal inference to do partially observable reinforcement learning causal effect; step 2. Take a determined The backdoor variable and the given are used as conditions, and the variables between and are New backdoor variables about (that is, stepping backdoor variables about Introduced for the first time! Using causal inference to do partially observable reinforcement learning and ), it is estimated that do'sConditional Causal Effect. Then the joint causal effect is the product integral of these two parts. The step-by-step backdoor criterion uses two steps of the ordinary backdoor criterion, as shown in the figure below

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

The above formula uses the more general variable indicator X .

For the case of more than three variables, by continuously using the step backdoor criterion - the variables between the adjacent intervening variables at each two time steps are regarded as step backdoor variables , continuously calculating the above formula, the joint causal effect of multi-variable intervention Introduced for the first time! Using causal inference to do partially observable reinforcement learning can be obtained as follows:

##Theorem 1. Given a set of intervened variables with different timestamps , if every two temporally adjacent variables meet the step-backdoor adjustment formula, then the overall causal effect can be estimated with

Specific to the partially observable reinforcement learning problem, after replacing x in the above formula with observation o, there is the following causal effect calculation formula:

Theorem 2. Given Introduced for the first time! Using causal inference to do partially observable reinforcement learning and , the causal effect of Do(o) can be estimated by

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

##At this point, the paper gives a formula for calculating spatial causal effects (i.e., time-averaged causal effects). This method reduces the number of interventions from O( Introduced for the first time! Using causal inference to do partially observable reinforcement learning ) to O(). The next step is to take advantage of the sparsity of spatial causal effects (mentioned at the beginning of this chapter) to further reduce the number of interventions exponentially. Replace the intervention on one observation with the intervention on an observation subspace - this is a common idea to take advantage of sparsity to speed up calculations (see original article). In this article, a fast counterfactual inference algorithm called Tree-based history counterfactual inference (T-HCI) is developed, which will not be described in detail here (see the original text for details). In fact, many historical causal inference algorithms can be developed based on the stepping backdoor criterion, and T-HCI is just one of them. The final result is Proposition 3 (Coarse-to-fine CI). If Introduced for the first time! Using causal inference to do partially observable reinforcement learning , the number of interventions for coarse-to-fine CI is).

The algorithm structure diagram is as follows

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

The algorithm contains two loops, one It is a T-HCI loop, and the other is a policy learning loop. The two are exchanged: in the policy learning loop, the agent is sampled to learn a certain number of rounds, and the samples are stored in the replay pool; in the T-HCI loop, the stored samples are used Carry out the causal inference process described above.

Limitations: Causal inference in the spatial dimension has already compressed the historical scale sufficiently. Although causal inference in the time dimension can further compress the historical scale, considering that the computational complexity needs to be balanced, this article retains correlation inference in the time dimension (using LSTM end-to-end on historical states with spatial causal effects) and does not use Causal reasoning.

Verification

Experimentally verified three points, responding to the previous claims: 1) Can T-HCI improve the sample efficiency of RL methods? 2) Is the computational overhead of T-HCI acceptable in practice? 3) Can T-HCI mine observations with causal effects? Please see the experimental chapter of the paper for details, so I won’t take up space here. Of course, interested friends can also send me a private message/comment.

Introduced for the first time! Using causal inference to do partially observable reinforcement learning

Possible directions for future expansion

Two points to start the discussion:

1. HCI is not limited to types of reinforcement learning. Although this article studies online RL, HCI can also be naturally extended to offline RL, model-based RL, etc., and you can even consider applying HCI to imitation learning;

2. HCI can be regarded as a special hard attention method-sequence points with causal effect receive an attention weight of 1, and otherwise receive an attention weight of 0. From this perspective, some sequence prediction problems may also try to be handled using HCI.

The above is the detailed content of Introduced for the first time! Using causal inference to do partially observable reinforcement learning. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

4 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7931

Java Tutorial

1652

CakePHP Tutorial

1412

Laravel Tutorial

1303

PHP Tutorial

1250

Related knowledge

Which of the top ten currency trading platforms in the world are the latest version of the top ten currency trading platforms Apr 28, 2025 pm 08:09 PM

The top ten cryptocurrency trading platforms in the world include Binance, OKX, Gate.io, Coinbase, Kraken, Huobi Global, Bitfinex, Bittrex, KuCoin and Poloniex, all of which provide a variety of trading methods and powerful security measures.

Recommended reliable digital currency trading platforms. Top 10 digital currency exchanges in the world. 2025 Apr 28, 2025 pm 04:30 PM

Recommended reliable digital currency trading platforms: 1. OKX, 2. Binance, 3. Coinbase, 4. Kraken, 5. Huobi, 6. KuCoin, 7. Bitfinex, 8. Gemini, 9. Bitstamp, 10. Poloniex, these platforms are known for their security, user experience and diverse functions, suitable for users at different levels of digital currency transactions

Decryption Gate.io Strategy Upgrade: How to Redefine Crypto Asset Management in MeMebox 2.0? Apr 28, 2025 pm 03:33 PM

MeMebox 2.0 redefines crypto asset management through innovative architecture and performance breakthroughs. 1) It solves three major pain points: asset silos, income decay and paradox of security and convenience. 2) Through intelligent asset hubs, dynamic risk management and return enhancement engines, cross-chain transfer speed, average yield rate and security incident response speed are improved. 3) Provide users with asset visualization, policy automation and governance integration, realizing user value reconstruction. 4) Through ecological collaboration and compliance innovation, the overall effectiveness of the platform has been enhanced. 5) In the future, smart contract insurance pools, forecast market integration and AI-driven asset allocation will be launched to continue to lead the development of the industry.

What are the top ten virtual currency trading apps? The latest digital currency exchange rankings Apr 28, 2025 pm 08:03 PM

The top ten digital currency exchanges such as Binance, OKX, gate.io have improved their systems, efficient diversified transactions and strict security measures.

What are the top currency trading platforms? The top 10 latest virtual currency exchanges Apr 28, 2025 pm 08:06 PM

Currently ranked among the top ten virtual currency exchanges: 1. Binance, 2. OKX, 3. Gate.io, 4. Coin library, 5. Siren, 6. Huobi Global Station, 7. Bybit, 8. Kucoin, 9. Bitcoin, 10. bit stamp.

How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to measure thread performance in C? Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

How much is Bitcoin worth Apr 28, 2025 pm 07:42 PM

Bitcoin’s price ranges from $20,000 to $30,000. 1. Bitcoin’s price has fluctuated dramatically since 2009, reaching nearly $20,000 in 2017 and nearly $60,000 in 2021. 2. Prices are affected by factors such as market demand, supply, and macroeconomic environment. 3. Get real-time prices through exchanges, mobile apps and websites. 4. Bitcoin price is highly volatile, driven by market sentiment and external factors. 5. It has a certain relationship with traditional financial markets and is affected by global stock markets, the strength of the US dollar, etc. 6. The long-term trend is bullish, but risks need to be assessed with caution.

See all articles