# Simulated Environment: Gymnasium

For many applications of LLM agents, the environment is real (internet, database, REPL, etc). However, we can also define agents to interact in simulated environments like text-based games. This is an example of how to create a simple agent-environment interaction loop with [Gymnasium](https://github.com/Farama-Foundation/Gymnasium) (formerly [OpenAI Gym](https://github.com/openai/gym)).

In [50]:
import gymnasium as gym
import tenacity

from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage,
    BaseMessage,
)
from langchain.output_parsers import RegexParser

## Define the agent

In [51]:
def get_docs(env):
    while 'env' in dir(env):
        env = env.env
    return env.__doc__

class Agent():
    def __init__(self, model, env):
        self.model = model
        self.docs = get_docs(env)
        
        self.instructions = """
Your goal is to maximize your return, i.e. the sum of the rewards you receive.
I will give you an observation, reward, terminiation flag, truncation flag, and the return so far, formatted as:

Observation: <observation>
Reward: <reward>
Termination: <termination>
Truncation: <truncation>
Return: <sum_of_rewards>

You will respond with an action, formatted as:

Action: <action>

where you replace <action> with your actual action.
Do nothing else but return the action.
"""
        self.action_parser = RegexParser(
            regex=r"Action: (.*)", 
            output_keys=['action'], 
            default_output_key='action')
        
        self.message_history = []
        self.ret = 0
        
    def reset(self, obs):
        self.message_history = [
            SystemMessage(content=self.docs),
            SystemMessage(content=self.instructions),
        ]
        obs_message = f"""
Observation: {obs}
Reward: 0
Termination: False
Truncation: False
Return: 0
        """
        self.message_history.append(HumanMessage(content=obs_message))
        return obs_message
        
    def observe(self, obs, rew, term, trunc, info):
        self.ret += rew
    
        obs_message = f"""
Observation: {obs}
Reward: {rew}
Termination: {term}
Truncation: {trunc}
Return: {self.ret}
        """
        self.message_history.append(HumanMessage(content=obs_message))
        return obs_message
        
    @tenacity.retry(stop=tenacity.stop_after_attempt(2),
                    wait=tenacity.wait_none(),  # No waiting time between retries
                    retry=tenacity.retry_if_exception_type(ValueError),
                    before_sleep=lambda retry_state: print(f"ValueError occurred: {retry_state.outcome.exception()}, retrying..."),
                    retry_error_callback=lambda retry_state: 0) # Default value when all retries are exhausted
    def act(self):
        act_message = self.model(self.message_history)
        self.message_history.append(act_message)
        action = int(self.action_parser.parse(act_message.content)['action'])
        return action

## Initialize the simulated environment and agent

In [52]:
env = gym.make("Blackjack-v1")
agent = Agent(model=ChatOpenAI(temperature=0.2), env=env)

## Main loop

In [53]:
observation, info = env.reset()
obs_message = agent.reset(observation)
print(obs_message)
while True:
    action = agent.act()
    observation, reward, termination, truncation, info = env.step(action)
    obs_message = agent.observe(observation, reward, termination, truncation, info)
    print(f'Action: {action}')
    print(obs_message)
    if termination or truncation:
        print('break', termination, truncation)
        break
env.close()


Observation: (10, 3, 0)
Reward: 0
Termination: False
Truncation: False
Return: 0
        
Action: 1

Observation: (18, 3, 0)
Reward: 0.0
Termination: False
Truncation: False
Return: 0.0
        
Action: 0

Observation: (18, 3, 0)
Reward: 1.0
Termination: True
Truncation: False
Return: 1.0
        
break True False
