이미지 캡셔닝(Image Captioning) 모델 구현

Python DeepLearning

이미지 캡셔닝(Image Captioning) 모델 구현

PyExplorer 2025. 5. 5. 21:31

728x90

이미지 캡셔닝(Image Captioning) 모델 구현

1. 이미지 캡셔닝이란?

이미지 캡셔닝(Image Captioning)은 주어진 이미지의 내용을 설명하는 자연어 문장을 생성하는 기술입니다. 이는 컴퓨터 비전과 자연어 처리가 결합된 문제로, 주로 CNN(Convolutional Neural Network)과 RNN(Recurrent Neural Network), 또는 Transformer 기반 모델을 활용하여 구현됩니다.

이미지 캡셔닝 모델은 다음과 같은 주요 응용 분야에서 활용됩니다.

시각 장애인을 위한 자동 이미지 설명 시스템
이미지 검색 및 태깅
자동 동영상 자막 생성
로봇 및 자율 주행 차량에서의 시각 정보 이해

본 포스팅에서는 CNN+LSTM 기반의 이미지 캡셔닝 모델을 TensorFlow와 Keras를 이용하여 구현하는 방법을 설명하겠습니다.

2. 모델 아키텍처 개요

이미지 캡셔닝 모델은 크게 두 부분으로 구성됩니다.

CNN(이미지 피처 추출기): 사전 학습된 CNN 모델(예: VGG16, ResNet, Inception)을 활용하여 이미지에서 특징 벡터(feature vector)를 추출합니다.
RNN(문장 생성기): LSTM 또는 GRU를 이용하여 이미지의 특징 벡터를 기반으로 단어 시퀀스를 생성합니다.

모델 동작 흐름

CNN을 사용하여 입력 이미지의 특징 벡터를 추출합니다.
추출된 특징 벡터를 LSTM 네트워크에 입력하여 캡션을 생성합니다.
LSTM은 단어를 하나씩 예측하며 문장을 완성합니다.

아래 그림은 이미지 캡셔닝 모델의 전체 구조를 나타냅니다.

[Input Image] → [CNN Feature Extractor] → [LSTM-based Caption Generator] → [Output Caption]

3. 데이터셋 준비

이미지 캡셔닝을 위한 대표적인 데이터셋으로는 Flickr8k, Flickr30k, MS COCO 등이 있습니다. 본 포스팅에서는 Flickr8k 데이터셋을 사용하여 모델을 학습하겠습니다.

데이터 다운로드 및 전처리

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout
from tensorflow.keras.utils import to_categorical
from PIL import Image
import matplotlib.pyplot as plt
import pickle

# 데이터셋 경로 설정
image_dir = "./Flickr8k/Flicker8k_Dataset"
caption_file = "./Flickr8k/Flickr8k_text/Flickr8k.token.txt"

# 캡션 데이터 불러오기
df = pd.read_csv(caption_file, delimiter='\t', names=['image', 'caption'])
df['image'] = df['image'].apply(lambda x: x.split('#')[0])
print(df.head())

위 코드에서는 Flickr8k 데이터셋을 로드하고 이미지 파일명과 캡션을 정리합니다.

4. CNN을 이용한 이미지 특징 추출

이미지의 특징을 추출하기 위해 VGG16 모델을 사용하겠습니다.

# VGG16 모델 로드 (Fully Connected Layer 제거)
base_model = VGG16(weights='imagenet')
feature_extractor = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)

# 이미지 특징 벡터 추출 함수
def extract_features(image_path):
    image = Image.open(image_path).resize((224, 224))
    image = np.array(image)
    image = preprocess_input(image)
    image = np.expand_dims(image, axis=0)
    features = feature_extractor.predict(image)
    return features.flatten()

# 모든 이미지에서 특징 추출
image_features = {}
for img_file in os.listdir(image_dir):
    image_path = os.path.join(image_dir, img_file)
    image_features[img_file] = extract_features(image_path)

# 특징 벡터 저장
with open("image_features.pkl", "wb") as f:
    pickle.dump(image_features, f)

이제 CNN을 이용하여 각 이미지에서 특징 벡터를 추출할 수 있습니다.

5. LSTM을 이용한 캡션 생성

5.1 단어 사전 구축

# 캡션 전처리 및 토큰화
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['caption'])
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

5.2 LSTM 모델 구현

embedding_dim = 256
max_length = 30

# 모델 설계
input_img = Input(shape=(4096,))
dense = Dense(256, activation='relu')(input_img)
image_features = tf.keras.layers.Reshape((1, 256))(dense)

input_text = Input(shape=(max_length,))
embedding = Embedding(vocab_size, embedding_dim, mask_zero=True)(input_text)
lstm = LSTM(256, return_sequences=True)(embedding)
lstm = LSTM(256)(lstm)

decoder = tf.keras.layers.add([image_features, lstm])
decoder = Dense(vocab_size, activation='softmax')(decoder)

model = Model(inputs=[input_img, input_text], outputs=decoder)
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

5.3 모델 학습

# 모델 학습
model.fit([X_images, X_texts], y_labels, epochs=20, batch_size=64, verbose=1)

6. 모델 평가 및 테스트

모델이 학습된 후, 이미지에 대한 캡션을 생성해봅니다.

def generate_caption(image_path, model, tokenizer, max_length):
    feature = extract_features(image_path)
    caption = "startseq"
    for _ in range(max_length):
        sequence = tokenizer.texts_to_sequences([caption])[0]
        sequence = pad_sequences([sequence], maxlen=max_length)
        y_pred = model.predict([feature.reshape(1, 4096), sequence])
        predicted_word = tokenizer.index_word[np.argmax(y_pred)]
        caption += ' ' + predicted_word
        if predicted_word == 'endseq':
            break
    return caption

# 테스트 실행
image_path = "./Flickr8k/Flicker8k_Dataset/example.jpg"
predicted_caption = generate_caption(image_path, model, tokenizer, max_length)
print("Generated Caption:", predicted_caption)

7. 결론

이번 포스팅에서는 CNN+LSTM 기반 이미지 캡셔닝 모델을 구현하는 방법을 살펴보았습니다. 앞으로 Transformer 기반 모델(BERT, ViT)과 같은 최신 기법을 활용한 이미지 캡셔닝 모델도 연구해볼 수 있습니다.

728x90

'Python DeepLearning' 카테고리의 다른 글

사전 학습된 모델(VGG, ResNet, EfficientNet) 활용법 (0)	2025.05.09
전이 학습(Transfer Learning)의 개념과 필요성 (0)	2025.05.08
자연어 처리(NLP) 기반 챗봇 모델 만들기 (0)	2025.05.04
음성 인식 모델 구현 (Librosa 및 딥러닝 활용) (0)	2025.05.02
시계열 예측 모델 만들기 (주식 가격 예측) (0)	2025.05.01

현재글이미지 캡셔닝(Image Captioning) 모델 구현

Deep Python Studio

Deep Python Studio에서는 Python의 기초부터 고급 주제, 데이터 분석, 딥러닝, AI까지 폭넓은 지식을 다룹니다. 초보자에게는 기초를, 숙련자에게는 심화 내용을 제공하여 Python으로 성장하는 여정을 함께합니다. Python의 무한한 가능성을 Deep하게 탐험해 보세요.

ai healthcare, python opencv, flask restful api, scipy stats, Numpy, LSTM, pytorch, scipy linalg, python exception, scipy.optimize, ResNet, seaborn, Perceptron, Transfer Learning, python scipy, data preprocessing, opencv equalizehist, django ORM, TensorFlow, Ai, pretrained model, scipy optimize, python tuple, Numpy random, python function, Numpy Array, pytorch dataloader, pytorch cnn, pandas reset_index, pytorch gan,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Deep Python Studio

이미지 캡셔닝(Image Captioning) 모델 구현

이미지 캡셔닝(Image Captioning) 모델 구현

1. 이미지 캡셔닝이란?

2. 모델 아키텍처 개요

모델 동작 흐름

3. 데이터셋 준비

데이터 다운로드 및 전처리

4. CNN을 이용한 이미지 특징 추출

5. LSTM을 이용한 캡션 생성

5.1 단어 사전 구축

5.2 LSTM 모델 구현

5.3 모델 학습

6. 모델 평가 및 테스트

7. 결론

'Python DeepLearning' 카테고리의 다른 글

'Python DeepLearning'의 다른글

티스토리툴바

이미지 캡셔닝(Image Captioning) 모델 구현

이미지 캡셔닝(Image Captioning) 모델 구현

1. 이미지 캡셔닝이란?

2. 모델 아키텍처 개요

모델 동작 흐름

3. 데이터셋 준비

데이터 다운로드 및 전처리

4. CNN을 이용한 이미지 특징 추출

5. LSTM을 이용한 캡션 생성

5.1 단어 사전 구축

5.2 LSTM 모델 구현

5.3 모델 학습

6. 모델 평가 및 테스트

7. 결론

'Python DeepLearning' 카테고리의 다른 글

'Python DeepLearning'의 다른글

관련글

티스토리툴바