EECS 118 - Determinism (or lack thereof)

EECS 118 • Study Notes • Determinism
Mahmoud Elfar • Fall 2025 • v0.2 (WiP)

v0.1: Initial version
v0.2: Fixed value iteration table issues


Table of Contents

  1. WHAT IS GOING ON?
  2. Lvl 1: DETERMINISTIC ENVIRONMENTS
  3. Lvl 2: NON-DETERMINISTIC ENVIRONMENTS
    3.1. Modeling - Nondeterministic TS (NTS)
    3.2. Searching/Solving - AND-OR Search, Contingency Planning
    3.3. Solving Outcomes - Plans
  4. Lvl 3: STOCHASTIC ENVIRONMENTS
    4.0. Probability Theory
    4.1. Modeling - MDPs

1. WHAT IS GOING ON?

Remember TS? We implicitly assumed that at state $s$, taking action $a$ leads to a single next state $s’$.
Well, life sucks and in the real world, there is a chance you end up in different states for one reason or another.
Determinism is your ability to predict exactly what state you’ll end up in.

Determinism is the level by which an action $a$ taken in state $s$ determines the next state $s’$.

There are three levels of determinism:

  1. Deterministic: Taking action $a$ in state $s$ always
  2. Non-deterministic: Taking action $a$ in state $s$ may lead to one of several possible next states
  3. Stochastic: Taking action $a$ in state $s$ leads to one of several possible next states according to some probability distribution (probability distribution just refers to the odds of ending up in each possible next state)

Why bother? Because the level of determinism affects:

As such, what you need to know about each level:

Levels of Determinism in a Nutshell

Determinism Level Definition Formalism Algorithms When/Why to Use
Deterministic $a$ in $s$ always leads to one $s’$ Transition System (TS) DFS, BFS, UCS, A* Simple problems, fully predictable environments
Non-deterministic $a$ in $s$ may lead to multiple $s’$ Nondeterministic TS (NTS) AND-OR search, Contingency Planning Environments with uncertainty but no probabilities
Stochastic $a$ in $s$ leads to $s’$ with probability $P(s’)$ Markov Decision Processes (MDPs) Value Iteration, Policy Iteration, Q-learning Real-world problems with uncertainty and probabilistic outcomes

** Final Exam Checklist **

↑ Back to top


2. Lvl 1: DETERMINISTIC ENVIRONMENTS

Slogan: One State ~ One Action ~ One Outcome

By now, you should be very familiar with, and sick of, TSs and deterministic search algorithms (DFS, BFS, UCS, A*).
Just remember:

2…… ** Final Exam Checklist **

Kinda weird having the final exam checklist longer than the actual section, right?

↑ Back to top


3. Lvl 2: NON-DETERMINISTIC ENVIRONMENTS

Slogan: One Stat ~ One Action ~ Multiple Outcomes

In non-deterministic environments, taking action $a$ in state $s$ may lead to one of several possible next states. Implications, you say?

3.1. Modeling - NDTS

3.1.1. Intermission: What is $2^S$?

Q: What is $2^S$?
A: The power set of S, aka the set of all subsets of S.
Q: Wot?
A: Let’s say the robot moves between two rooms (two grid cells), rooms 1 and 2.
So the state space is $S = \lbrace s_1, s_2 \rbrace$.
So the powerset is $2^S = \lbrace \emptyset, \lbrace s_1 \rbrace, \lbrace s_2 \rbrace, \lbrace s_1, s_2 \rbrace \rbrace $.
Q: $\emptyset$-what?
A: The empty set, aka the set with no elements, also written as { }.

3.1.2. Nondeterministic Transition System (NDTS)

A Nondeterministic Transition System (NDTS) is a tuple $(S, A, T)$ where:

3.2. Searching/Solving - AND-OR Search, Contingency Planning *

We informally discussed AND-OR search and Contingency Planning in lecture.
What you need to know for the final exam:

3.3. Solving Outcomes - Contingency Plan *

The outcome of solving a non-deterministic environment is a contingency plan.

* Q: WAIT A MINUTE… We never covered this in lecture, did we?

A: Correct, we only covered the concepts themselves through examples and discussions. But we never mentioned AND-OR search or Contingency Planning or whatever that is.

Q: Why?

A: Because, in practice, you will never use AND-OR search. The algorithm itself is not like a novel thing that we could learn from or something. I don’t even know why this exists. Actually, I do know – it is only used in teaching as a transition (moving between topics, not our transition function) from deterministic search to stochastic search. But in practice, this is never used. So, moving on. Rant ends.

HOWEVER… We covered important insights through examples and discussions

Some of those important insights:

a) When searching/solving non-deterministic environments, you consider THE WORST-CASE outcome.
b) If any of the outcomes leads to failure/hazard state, then the whole action is considered to lead to failure/hazard.

Again, worst-case thinking.

c) The goal is usually to reach a goal state with absolute certainty.

Remember: we didn’t need to worry about possible outcomes when we were in deterministic environments using TS along with DFS. The good old days.

3__ ** Final Exam Checklist ** __

↑ Back to top


4. Lvl 3: STOCHASTIC ENVIRONMENTS

Slogan: One State ~ One Action ~ Multiple Probabilistic Outcomes

In stochastic environments, taking action $a$ in state $s$ leads to one of several possible next states, each with some probability.

4.0. Probability Theory

If you are familiar with basic probability theory, you can skip this section. Otherwise, brace for impact.

To understand probabilities, you need to get very, very comfortable with a few key concepts. You need to be able to read and understand the following notation:

Conditional Probability Notation

\[\Huge P(s' \mid s, a) = p_{ss'}^{a}\]

*The **probability** of transitioning to $s'$ **given that** we are in $s$ and take action $a$ is $p_{ss'}^{a}$.*

If you are still confused, that’s okay. Things will clear up a bit as we resume our discussion on stochastic environments in this very next section.

4.1. Modeling - MDPs

The formalism that models stochastic environments is called Markov Decision Processes (MDPs).

A Markov Decision Process (MDP) is a tuple:

\[\Huge \mathcal{M} = (S, A, T, R, \gamma)\]

where:

I can only imagine how confusing this must be right now. That’s why I started the lectures with the card game examples so you can see the transitions, probabilities, and rewards in action first, before throwing formal definitions at ya.

In any case, let’s look at the two key changes here compared to TS:

4.1.1. Stochastic Transition Function

It’s just a way for expressing that there can be multiple possible next states (outcomes), and that we can quantify the probability of each (i.e., how likely each outcome is to happen).

Warning / Attention / Achtung / Ojo / Attenzione / Atenção / تنبيه / 注意 / 注意 / 주의 / सावधान / Внимание / Wee-Woo

All those three functions are called “transition functions”, but they are different functions!

Signature Notation Output Determinism Level Appears in
$T: S \times A \rightarrow S$ $T(s,a) = s’$ $s’ \in S $ Deterministic TS
$T: S \times A \rightarrow 2^S$ $T(s,a) = \lbrace s’_1, s’_2, …\rbrace$ $ S’ \subseteq S $ Nondeterministic NDTS
$T: S \times A \times S \rightarrow [0,1]$ $T(s,a,s’) = p_{ss’}^a$ $p \in [0,1]$ Stochastic MDPs, POMDPs
Proof they are different:

All three functions have different signatures. Hence, they are different functions, which completes the proof.

4.1.2. Reward Function

It is just a way to express that some states or transitions are better / worse than others.

\[\Huge R : S \times A \times S \to \mathbb{R}\]

Warning / Attention / Achtung / Ojo / Attenzione / Atenção / تنبيه / 注意 / 注意 / 주의 / सावधान / Внимание / Wee-Woo

You will see in literature three types of reward functions:

Signature Notation Output Significance (When to Use) Formalism Where They Appear
$R: S \rightarrow \mathbb{R}$ $R(s)$ $r_s \in \mathbb{R}$ Rewards being in a state,
independent of how you got there
or what you do next
MDPs,
goal-based planning,
grid worlds with terminal rewards
$R: S \times A \rightarrow \mathbb{R}$ $R(s,a)$ $r_s^a \in \mathbb{R}$ Rewards state-action pairs,
capturing cost/benefit of taking
specific actions in specific states
MDPs, Q-learning,
action costs (e.g., moving vs. staying),
resource consumption
$R: S \times A \times S \rightarrow \mathbb{R}$ $R(s,a,s’)$ $r_{ss’}^a \in \mathbb{R}$ Rewards transitions,
outcome-dependent rewards
MDPs,
transition-based penalties,
full generality

Note: In the final exam, you will be explicitly told which of the three reward function types to use in a given question.

But, why do we need rewards in the first place???

4.1.3. Discount Factor

Ehh, don’t worry about it for now.

4.1.4. Example MDP: Four-Card Game

In this game:

Carefully study the following state transition diagram of an MDP model for this four-card game (only part of the diagram is shown)

State Transition Diagram

Some of the things to pay attention to:
State Transition Closeup

On a side note. You will see this state transition diagram referred to as “state transition graph” or “state transition model” or “state transition diagram” or “MDP graph” or “MDP model” or “MDP diagram” or whatever. They all refer to the same thing. The “state transition diagram” is the most accurate. The “MDP model”, or the “MDP” for short, is the most commonly used. So from now on, we will refer to it as the “MDP”.

↑ Back to top

4.2. Searching/Solving MDPs

Now that we know how to model stochastic environments (hint: we use MDPs), the next question is: how do we search/solve them?

Actually, before that, what does it even meant to “solve” an MDP?

Recall back in the old days of TS and DFS, solving meant finding a plan (sequence of actions). As we showed in the lecture, that doesn’t work here because, you don’t know which state you will end up in along the way. Taking actions blindly without knowing the actual outcomes is like trying to find a parking in APS with a blindfold.

The solution is to adapt your actions based on the actual outcome, i.e., based on what state you actually end up in after each action. We call such a solution a policy.

4.2.1. Policy

\[\Huge \pi: S \to A\]

4.2.2. Solving MDPs - DIWhy Method

I like to start with examples. Let’s start with the simplest example possible. Be careful though, as we go, we will add more complexity to the same example, so make you pay attention to what applies to which version of the example. In general:

Check out the visualization of the value iteration process here: Grid World Value Iteration Visualization

[a] Grid World - 2x2 - Precise Actions
   ┌─────┬─────┐
2  │     │   G │
   ├─────┼─────┤
1  │  R  │     │
   └─────┴─────┘
     1      2

Goal:

Method: We’ll use Value Iteration - repeatedly apply the Bellman equation until values converge. We don’t know what any of that means, and that’s okay, remeber this:

Iteration 0: Create an empty state-action-value table:

State $\; N \;$ $\; E \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1 = \langle 1, 1 \rangle$ 0 0 0 0 ?
$s_2 = \langle 2, 1 \rangle$ 0 0 0 0 ?
$s_3 = \langle 1, 2 \rangle$ 0 0 0 0 ?
$s_4 = \langle 2, 2 \rangle$ 0 0 0 0 ?

Iteration 1: Your total value = what you already have + what you expect to get from taking action $a$ in state $s$.

State $\; N \;$ $\; E \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1 = \langle 1, 1 \rangle$ 0 0 0 0 ?
$s_2 = \langle 2, 1 \rangle$ 0+10 0 0 10 E
$s_3 = \langle 1, 2 \rangle$ 0 0+10 0 10 N
$s_4 = \langle 2, 2 \rangle$ 0 0 0 0 ?

Iteration 2: Repeat the process

State $\; N \;$ $\; E \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1 = \langle 1, 1 \rangle$ 0+10 0+10 0 10 E or N
$s_2 = \langle 2, 1 \rangle$ 10 0 0 10 E
$s_3 = \langle 1, 2 \rangle$ 0 10 0 10 N
$s_4 = \langle 2, 2 \rangle$ 0 0 0 0 ?

Iteration 3: Repeat the process.

State $\; N \;$ $\; E \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1 = \langle 1, 1 \rangle$ 10 10 0 10 N or E
$s_2 = \langle 2, 1 \rangle$ 10 0 0 10 N
$s_3 = \langle 1, 2 \rangle$ 0 10 0 10 E
$s_4 = \langle 2, 2 \rangle$ 0 0 0 0 ?

Final Policy: $\pi^*$ is defined as follows:

State Optimal Action
$s_1 = \langle 1, 1 \rangle$ N or E
$s_2 = \langle 2, 1 \rangle$ N
$s_3 = \langle 1, 2 \rangle$ E
$s_4 = \langle 2, 2 \rangle$ ?
[b] Grid World - 2x2 - Slippery Actions

What changed:

Iteration 0:

State $\; E \;$ $\; N \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1 = \langle 1, 1 \rangle$ 0 0 0 0 ?
$s_2 = \langle 2, 1 \rangle$ 0 0 0 0 ?
$s_3 = \langle 1, 2 \rangle$ 0 0 0 0 ?
$s_4 = \langle 2, 2 \rangle$ 0 0 0 0 ?

Iteration 1:

State $\; E \;$ $\; N \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1$ 0 0 0 0 ?
$s_2$ 0.8(0) + 0.1(10) + 0.1(0) = 1 0.8(10) + 0.1(0) + 0.1(0) = 8 0 8 N
$s_3$ 0.8(10) + 0.1(0) + 0.1(0) = 8 0.8(0) + 0.1(10) + 0.1(0) = 1 0 8 E
$s_4$ 0 0 0 0 ?

Iteration 2:

State $\; E \;$ $\; N \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1$ 0.8(8) + 0.1(8) + 0.1(0) = 7.2 0.8(8) + 0.1(8) + 0.1(0) = 7.2 0 7.2 E or N
$s_2$ 1 0.8(10) + 0.1(0) + 0.1(7.2) = 8.72 0 8.72 N
$s_3$ 0.8(10) + 0.1(0) + 0.1(7.2) = 8.72 1 0 8.72 E
$s_4$ 0 0 0 0 ?

Iteration 3:

State $\; E \;$ $\; N \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1$ 0.8(8.72) + 0.1(8.72) + 0.1(0) = 7.85 0.8(8.72) + 0.1(8.72) + 0.1(0) = 7.85 0 7.85 E or N
$s_2$ 1 0.8(10) + 0.1(0) + 0.1(7.85) = 8.79 0 8.79 N
$s_3$ 0.8(10) + 0.1(0) + 0.1(7.85) = 8.79 1 0 8.79 E
$s_4$ 0 0 0 0 ?

Values converge after several more iterations. Actually, you can write the table using a formula:

State $\; E \;$ $\; N \;$ $\; \tau \;$ $V(s)$ Optimal Action
$s_1$ $0.8V(s_2) + 0.1V(s_3) + 0.1(0)$ $0.8V(s_3) + 0.1V(s_2) + 0.1(0)$ 0 $V(s_1)$ E or N
$s_2$ $0.8(0) + 0.1(10) + 0.1V(s_1)$ $0.8(10) + 0.1(0) + 0.1V(s_1)$ 0 $V(s_2)$ N
$s_3$ $0.8(10) + 0.1(0) + 0.1V(s_1)$ $0.8(0) + 0.1(10) + 0.1V(s_1)$ 0 $V(s_3)$ E
$s_4$ 0 0 0 0 ?

Final Policy:
Same as variant [a], but values are lower due to uncertainty.

↑ Back to top

To be continued ...