Due to the nature of these experiments, it may not be feasible for individual developers to rerun the results as GPT-4 has limited access and significant API charges. All runs from the paper and additional results are logged in `./programming_runs/root` for programming, `./alfworld_runs/root` for decision-making, and `./hotpotqa_runs/root` for reasoning. Programming runs can be validated with scripts [here](https://github.com/noahshinn024/reflexion/blob/main/programming/validate_py_results.py) and [here](https://github.com/noahshinn024/reflexion/blob/main/programming/validate_rs_results.py) to validate the Python and Rust solutions with the unit tests provided by their respective benchmarks.