본문 바로가기 주메뉴 바로가기
국회도서관 홈으로 정보검색 소장정보 검색

목차보기

Title Page

Abstract

Contents

Ⅰ. INTRODUCTION 9

Ⅱ. BACKGROUND 11

2.1. Curriculum Learning 11

2.2. Approaches for Solving Unsupervised Environment Design 12

2.3. Overcooked! Game 12

2.4. Proximal Policy Optimization 13

Ⅲ. METHOD 14

3.1. Modified Prioritized Level Replay 14

3.2. Level Scoring Metric for Learning Potential 17

3.2.1. Generalized Advantage Estimate 17

3.2.2. Minimax Return 17

3.2.3. Threshold 18

3.3. Dissimilarity between Levels 18

Ⅳ. EXPERIMENT 20

4.1. Experiment Settings 20

Ⅴ. RESULT 22

5.1. Zero-shot Performance Result 22

5.2. Difficulty of Regret Approximation 24

5.3. Ablation Studies on the Dissimilarity Metric 25

Ⅵ. CONCLUSION 28

REFERENCES 29

APPENDIX 33

A. Additional information about map generation 33

B. Generated map from different generator 34

C. Experiment Detail 35

D. Self-play PPO 36

E. The network structure of PPO agent 37

List of Tables

Table 1. Hyperparameters of map generator 21

List of Figures

Figure 1. Game levels on the online Overecooked game 12

Figure 2. Overview of Modified Prioritized Level Replay. The modified part is highlighted in red 14

Figure 3. Example of hamming distance calculation 19

Figure 4. Samples of the training map with 7×5 size 20

Figure 5. Samples of the training map with 5×5 size 20

Figure 6. Block types of Overcooked map 20

Figure 7. Train and test performance with proposed level score metric 22

Figure 8. Index count of level according to the sampling method 23

Figure 9. Train and test performance with proposed level score metric 23

Figure 10. Mean episodic test rewards on each size of Overcooked map 25

Figure 11. Mean episodic train rewards on big map 25

Figure 12. Index count of level according to the dissimilarity metric 26

Figure 13. Index count of level according to the dissimilarity metric 26

초록보기

 Reinforcement learning currently faces challenges in agents exhibit the ability to overfit to training sets and suffer in the generalization from small changes in their environment. To address this issue, recent studies are exploring regret-based curriculum learning approaches to enhance the robustness of the agents. These methods aim to accelerate learning by gradually providing agents with more challenging environments without prior domain knowledge. However, applying regret-based curriculum learning in a cooperative multi-agent setting presents difficulties. Unlike previous curriculum learning setups which single or competitive agent settings, each agent shares the same group reward and must consider the sub-optimal policy of the other agent. This aspect poses difficulties in accurately estimating an agent's regret which approximates the learning potential of the environment. In this paper, we present a suitable sampling method for the cooperation environment by applying environment-diverse metrics that use hamming distance to previous sampling techniques. Following the verification process conducted on the Overcooked environment, the sampling method based on minimizing agents' return demonstrates better zero-shot performance compared to random sampling. Furthermore, the proposed metric to measure the dissimilarity between environments effectively resolves the overfitting of replaying a specific map.