Akashic Records

15.5 강화학습 기반 딥러닝(DQN, A3C 등) 본문

Python for Beginners

15.5 강화학습 기반 딥러닝(DQN, A3C 등)

Andrew's Akashic Records 2023. 5. 2. 13:13
728x90

강화학습(Reinforcement Learning)은 에이전트가 환경과 상호작용하며, 보상을 최대화하는 행동을 학습하는 방법입니다. 강화학습은 다양한 분야에서 사용되며, 최근에는 딥러닝과 결합하여 높은 성능을 보이고 있습니다. DQN(Deep Q-Network)과 A3C(Asynchronous Advantage Actor-Critic)는 딥러닝 기반 강화학습 알고리즘 중 가장 유명한 두 가지입니다.

1. DQN(Deep Q-Network): DQN은 Q-Learning 알고리즘과 딥러닝을 결합한 알고리즘입니다. Q-Learning은 상태-행동 쌍에 대한 가치를 추정하는 Q함수를 사용합니다. DQN은 Q함수를 근사하는 신경망을 사용하며, 경험 리플레이(Experience Replay)와 타겟 네트워크(Target Network)를 도입하여 학습의 안정성을 높입니다.

2. A3C(Asynchronous Advantage Actor-Critic): A3C는 에이전트를 여러 개 동시에 학습시켜, 에피소드가 서로 다른 경험을 가질 수 있도록 합니다. 이로 인해 학습이 안정되고 빨라집니다. A3C는 액터-크리틱 방식으로, 액터가 행동을 선택하고 크리틱이 행동의 가치를 평가합니다.

이제 DQN을 사용하여 OpenAI Gym에서 제공하는 'CartPole-v0' 환경을 학습하는 예시 코드를 살펴보겠습니다.

import numpy as np
import random
from collections import deque
import tensorflow as tf
from tensorflow.keras import layers, models
import gym

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self.build_model()

    def build_model(self):
        model = models.Sequential()
        model.add(layers.Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(layers.Dense(24, activation='relu'))
        model.add(layers.Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

# 환경 생성
env = gym.make('CartPole-v0')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# 에이전트 생성
agent = DQNAgent(state_size, action_size)
EPISODES = 1000
batch_size = 32

for e in range(EPISODES):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(500):
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print(f"Episode: {e+1}/{EPISODES}, Score: {time}, Epsilon: {agent.epsilon:.2}")
            break

        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

 

이 코드는 다음과 같은 작업을 수행합니다:

1. CartPole 환경을 생성하고, 에이전트를 초기화합니다.
2. 에이전트가 상태를 기반으로 행동을 선택하고, 환경에서 행동을 실행합니다.
3. 행동에 따른 보상과 다음 상태를 에이전트의 메모리에 저장합니다.
4. 경험 리플레이를 사용하여 메모리의 일부를 무작위로 선택하고, Q함수를 근사하는 신경망을 학습시킵니다.
5. 에이전트의 탐색 확률(epsilon)을 점차 줄입니다.

DQN과 A3C 외에도 PPO(Proximal Policy Optimization), SAC(Soft Actor-Critic) 등 다양한 딥러닝 기반 강화학습 알고리즘이 존재합니다. 이러한 알고리즘을 사용하여 다양한 문제에 대한 강화학습 모델을 구현할 수 있습니다.

 
728x90

'Python for Beginners' 카테고리의 다른 글

16.2 코드 최적화 기법  (0) 2023.05.08
16.1 프로파일링  (0) 2023.05.08
15.4 자연어 처리(NLP, RNN, LSTM, Transformer)  (0) 2023.05.02
15.3 컴퓨터 비전(CNN)  (0) 2023.05.02
15.2 텐서플로와 케라스  (0) 2023.05.02
Comments