pathwisely uniqueness:
Suppose the reward is a deterministic function of state and action. Define
This implies
Note that:
Similarly,
In this section, we will derive the dynamic programming principle, also known as the Bellman Equation.
Admissible controls:
Proof. For (
Taking supremum of left side and obtain the desires.
On the other hand, For all
Now the result follows from the the arbitrariness of
Now we can define
If
Policy Iteration For solving Bellman Equations:
Matrix Form:
Define
This implies
Remark:
Suppose otherwise, then null space of
Suppose
contradicts
Iteration Algorithms For solving Linear equations:
import numpy as np
def fixed_point_iteration(A, b, x0, tol=1e-8, max_iter=1000):
x = x0.copy()
for k in range(max_iter):
x_new = A.dot(x) + b
if np.linalg.norm(x_new - x, ord=2) < tol:
print(f"在 {k+1} 次迭代后收敛")
return x_new
x = x_new
return x
if __name__ == "__main__":
A = np.array([[0.5, 0.1],
[0.2, 0.3]])
b = np.array([1.0, 2.0])
x0 = np.zeros(2)
solution = fixed_point_iteration(A, b, x0)
print("迭代求得的解为:", solution)