(GPT-2) Language Models are Unsupervised Multitask Learners (feat. GPT2 모델 및 zero-shot 구현 코드)

paper review/NLP

by Matthew0633 2022. 7. 22. 13:59

(GPT-2) Language Models are Unsupervised Multitask Learners 논문 리뷰

Google Machine Learning Bootcamp 2022 에서 "NLP 논문 리뷰 스터디" 에 참여하며 정리한 자료입니다

<시작하기 전, 간단한 리뷰 후기>
GPT-1 에서 OpenAI는 pre-training 의 유용성을 검증하려 pre-training의 횟수에 따른 zero shot 모델의 결과를 공유했다. 나는 GPT-1를 읽을 때까지만해도 OpenAI가 여기에 얼마나 눈을 번뜩이고 있었는지 알지 못했다. Google은 비교적 순박하게(?) 기존의 pre-training + fine-tuning 의 프레임 내에서 GPT의 사전학습 objective를 수정하여 여러 task 에서 향상된 BERT의 성능을 자랑했다. (연구의의가 작다는 뜻이 아니다)

그러나, OpenAI는 GPT-2를 통해 "그거 아니? 지금 세부 task 점수 올리며 싸우는게 중요한게 아니야.. 사전학습된 언어모델 이거 대박이야.. 모든 NLP task를 학습없이 잠깐 보고 바로 따라할 수 있는 meta-learner 라니까? " 라고 말하며, Language Model의 능력에 대해 한층 더 고차원적인 통찰과 실험결과를 제시했다. 논문 내 여러 실험에서 실제로 대량의 데이터셋으로 사전학습된 대용량 언어모델이 지도학습 없이, 여러 task를 보고 따라할 수 있는 가능성을 보였으며, Language Model 또한 Generalized Model 으로서 연구되기 시작했다는 점에서 의미가 크다고 생각된다.

이같이 SOTA만을 외치던 기존 논문들과는 다르게 새로운 관점과 연구 의의를 제시했다는 점, train, test set 간에 overlapping 검증을 통해 연구 타당성을 높이는 분석 등의 내용 때문에 개인적으로 너무나 흥미롭고, 놀라워하며 읽었다.

Abstract

GPT-2 (1.5B Params) : 모델 사이즈업 및 새로운 데이터를 통해 8개 중 7개의 데이터셋에서 Language Modeling SOTA 달성했다
지도학습 없이 여러 task에 대한 Language Model의 zero shot 결과를 제시하여, multi-task learner로서의 가능성을 제시했다

1. Introduction

ML 모델의 뛰어난 성능
- Large Dataset + high-capacity Model + supervised-learning 을 통해 ML 모델은 뛰어난 성능을 가지게 되었다.
Single task trainer
- 여전히 학습한 특정 task에만 우수(narrow expert) 하고, Labeled dataset을 필요로 한다
Generalized Model 의 필요성
- Meta Learning 등장 : input 으로 (dataset, objective) 사용한다
- 선행연구에서 Language Model이 지도학습 없이 상식추론 및 감성분석 수행 가능함을 확인했다
- General methods of transfer (zero shot) : 일체 없는 파라미터나 구조상의 수정으로 새로운 task-learning이 가능하다 (BERT, GPT, ELMo 는 수정 필요)

2. Approach

Main objective : Language Modeling (+ self-Attention)

Generalized Model (Multi-task learner)
- objective : Task 를 배우고, 해당 task의 input 을 바탕으로 output을 예측
- objective : $P(output|input)$ → $P(output|input, task)$
- 가능성 확인 : 선행연구에서 Dialogue 를 학습한 LM이 QA를 배워 수행

GPT-2 는 Large Language Model 의 Multi-task learner 능력을 확인 (zero-shot)

2.1 Training Dataset

다양한 domain 의 대량의 데이터셋을 필요
기존 Data인 Common Crawl의 품질 issue → 새로운 Dataset 구축 시도
WebText
- Reddit scrape (≥ 3 karma)
- text of 45 million links → (cleaning, de-duplication) → 8 millions (40GB)
- Wikipedia 문서 제거 (Benchmark 테스트셋과 overlapping 우려)

2.2 Input Representation

General Language Model은 모든 string input 대해 문제없이 동작할 수 있어야한다. (= 모델 내에서 representation을 얻을 수 있어야한다)
이를 가능하게 하는 방법이 Unicode-level의 vocab을 사용하는 것이다. 그러나 이는 단점이 존재한다. Word-level 에 비해 최소단위가 가지는 의미가 소멸되어, task에서 최적의 성능을 보장하지 않는다. 반면 Word-level 의 vocab 사용은 의미 단위로는 적합하지만, 단어가 무한정 필요하며, 그마저도 늘 OOV 문제를 가진다는 단점을 가진다.
이러한 두가지 방법의 단점을 극복한 것이 Byte Pair Encoding (BPE) 이다. 그러나 실제로 학습을 통해 vocab을 구축할 때, 우선적으로 unicode-level 단위로 동작하여, 여전히 많은 vocab 이 필요한 단점을 가진다. 그렇다고 아예 Unicode-level의 vocab을 제한한다면, <Unk>의 비율이 커짐에 따라 비효율이 발생할 수 있다.
따라서 BPE에 byte-level로 얻어진 단어에는 character-level의 token이 더이상 결합되지 않도록 rule 을 추가한다. (단, 결합하는 것이 매우매우 더 효율적인 경우는 예외로 한다)

word-level과 byte-level의 장점을 결합함으로써, 어떠한 데이터셋도 input으로 받을 수 있게 되었다

2.3 Model

Transformer based (GPT와 유사)
Layer normalization 이동 (input of each sub-block)
Layer normalization 추가 (after final self-attention block)
weight initialization of residual layers : scale factor of $\frac{1}{\sqrt n}$ (N is number of residual layers)
50,257 vocab
sequence length : 1024 tokens
512 batch size

"""
< Huggingface/Transformers 코드로 보는 GPT2 single block 구성 >
"""

class GPT2Block(nn.Module):
    def __init__(self, config, layer_idx=None):
		...
        self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
        self.attn = GPT2Attention(config, layer_idx=layer_idx)
        self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)

        if config.add_cross_attention:
            self.crossattention = GPT2Attention(config, is_cross_attention=True, layer_idx=layer_idx)
            self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)

        self.mlp = GPT2MLP(inner_dim, config)

    def forward(
        self,
        hidden_states: Optional[Tuple[torch.FloatTensor]],
       ....
    ):
        residual = hidden_states
        
        # LayerNorm 추가 (block 시작지점)
        hidden_states = self.ln_1(hidden_states)
        attn_outputs = self.attn(
            hidden_states,
            layer_past=layer_past,
            attention_mask=attention_mask,
            head_mask=head_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
        outputs = attn_outputs[1:]
        
        # Residual Connection : attention 결과와 attention의 입력 hs
        hidden_states = attn_output + residual

        if encoder_hidden_states is not None:
            # add one self-attention block for cross-attention
            residual = hidden_states
            
            # LayerNorm 추가
            hidden_states = self.ln_cross_attn(hidden_states)
            cross_attn_outputs = self.crossattention(
                hidden_states,
                attention_mask=attention_mask,
                head_mask=head_mask,
                encoder_hidden_states=encoder_hidden_states,
                encoder_attention_mask=encoder_attention_mask,
                output_attentions=output_attentions,
            )
            attn_output = cross_attn_outputs[0]
            
            # Residual Connection : attention 결과와 attention의 입력hs
            hidden_states = residual + attn_output
            outputs = outputs + cross_attn_outputs[2:]  # add cross attentions if we output attention weights

        residual = hidden_states
        
        # (LayerNorm +) FFN Layer 
        hidden_states = self.ln_2(hidden_states)
        feed_forward_hidden_states = self.mlp(hidden_states)
        
        # Residual Connection : FFN Layer 결과와 FFN Layer 입력hs
        hidden_states = residual + feed_forward_hidden_states

        if use_cache:
            outputs = (hidden_states,) + outputs
        else:
            outputs = (hidden_states,) + outputs[1:]

        return outputs  # hidden_states, present, (attentions, cross_attentions)

"""
< layer_idx 에 따른 residual layer (attention layer, ffn ..) 의 weight scaling >
"""

class GPT2Model(GPT2PreTrainedModel):
    _keys_to_ignore_on_load_missing = ["attn.masked_bias"]

    def __init__(self, config):
        super().__init__(config)

        self.embed_dim = config.hidden_size

        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)

        self.drop = nn.Dropout(config.embd_pdrop)
        
        # layer_idx (=block_idx 0, 1, 2...)에 따라 weight 에 scale factor 적용
        self.h = nn.ModuleList([GPT2Block(config, layer_idx=i) for i in range(config.num_hidden_layers)])
        
        
        self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
        
        
        
   class GPT2Attention(nn.Module):
       
       ...
   
       def _attn(self, query, key, value, attention_mask=None, head_mask=None):
            attn_weights = torch.matmul(query, key.transpose(-1, -2))

            if self.scale_attn_weights:
                attn_weights = attn_weights / (value.size(-1) ** 0.5)
			
            # layer_idx (block_idx)에 따라 weight에 scale factor 적용 (Layer-wise attention scaling)
            if self.scale_attn_by_inverse_layer_idx:
                attn_weights = attn_weights / float(self.layer_idx + 1)

3. Experiments

4가지 크기의 모델 실험 (가장 큰 모델 : GPT-2)
LR tuned with best PPL of 5% hold-out

3.1 Language Modeling

Ability to transfer of WebText LM to other LM
WebText LM : byte-level (no tokenization, preprocessing)
other datasets :
- different level of tokenization 사용
- unique preprocessed artifact 존재
invertible de-tokenizers 사용 (remove pre-preocessing artifacts) : + PPL 2.5 ~ 5.0

# detokenizer : rule-based 이므로, 반드시 원래대로 복구하지는 않음
import nltk.tokenize
tokens = nltk.tokenize.TreebankWordTokenizer().tokenize("I wanna watch something")
print(tokens) # ['I', 'wan', 'na', 'watch', 'something']

sentence = nltk.tokenize.treebank.TreebankWordDetokenizer().detokenize(tokens)
print(sentence) # I wannawatch something

well across domains and datasets (8개 중 7개 SOTA)
Large improvement on small dataset (Penn Treebank, WikiText-2)
Large improvements on long term dependencies (LAMBADA, Children’s Book Test)
1BW : largest and destructive pre-processing, shuffled

3.2 Children’s Book Test (CBT)

Language Modeling of different categories of words (named entity, noun, verb, preposition)
Cloze task : 10 possible choice for omitted word
Maximize $P(X_t|X_1...X_{t-1}) + P(X_{t+1}...X_{n}|X_1...X_{t})$ : 해당 choice 의 log probability 와 choice 가 주어질 때 이후 이어질 문장의 log probability 를 최대로 하는 choice 사용
Larger Model → improvement
SOTA 93.3% (common nouns), 89.1% (named entities)
de-tokenizater 사용

3.3 LAMBDA

Ability to model long-range dependency
task : Prediction of final word of sentence
SOTA PPL 8.6 (cf. PPL 99.8), LM Acc 52.66% (cf. Acc 19%)
Errors : valid continuation
without preprocessing + stopword filtering

3.4 Winograd Schema Challenge

Commonsense reasoning (resolve ambiguity)
small data (273 examples)
SOTA : Acc 70.70% (+7%)

3.5 Reading Comprehension

Reading Comprehension ability
CoQA : 7 Domain (dialogue QA)
SOTA : F1 55
simple retrieval : who → name in document

3.6 Summarization

Summarization ability
“ TL; DR: “로 Input 에서 task hint 제공 (제거 시 성능 6.4% 하락, 성능 추론 능력 확인)
Top-k random sampling (k=100)
first 3 generated sentence 사용
CNN and Daily Maily dataset + ROUGE 1, 2, L metric

3.7 Translation

Translation ability between one language to another
WMT14 (Engish, French) 사용
ENG → FRN : BLEU 5
FRN → ENG : BLEU 11.5
비교적 적은 양(500X) 을 학습한 Language 과의 번역 수행
Example pair 들을 Input으로 사용 (오늘날의 few-shot)

Input : "ENGLISH SENTENCE1 = FRENCH SENTENCE1 ENGLISH SENTENCE2 = "
Output : "FRENCH SENTENCE2"

3.8 Question Answering

EM 4.1% (5.3x of smallest model)

# QA Metric 두가지 : EM, F1 score
def compute_exact_match(prediction, truth): # 일치하면 1 or 0
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth): # token 단위로 f1-score 계산
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

ODQA 에서는 SOTA 대비 30 ~ 50% 낮은 성능
Example pair 들을 Input으로 사용 (단답 유도, 오늘날의 few-shot)

Input : "CONTEXT1; Q:Q1; A:A1; CONTEXT2; Q:Q2; A:A2; CONTEXT3; Q:Q3; A:"
Output : "A3"

(3.5 ~ 3.8) 4가지 task의 아쉬운 성능

4. Generalization vs Memorization

GPT-2 : WebText → (Prediction) → Benchmark Dataset (Test set)

새로운 Dataset 인 “WebText” 를 구축하고 학습에 사용했기에, 저자는 학습데이터에 Benchmark 데이터의 Test set 내용이 존재하여 성능이 과도하게 측정되었는지 확인 및 분석했다.

이는 새로운 데이터셋을 학습한 GPT-2의 일반화 성능이 객관적인지 검토하는 과정으로 볼 수 있다. 그 이유는, 만약 WebText 학습데이터에 Benchmark 테스트 셋의 내용이 있다면, Generalization 성능이 높다기보다 학습경험에 의한 Memorization 영향이 클 가능성이 높다.

해당논문은 두가지 방법으로 학습데이터와 테스트 데이터 간 Overlapping 을 검증하여 GPT-2 의 성능이 Generalization 능력에 기반하는지, 아니면 Memorization의 영향이 큰 지 검증한다.

n-gram filter (Bloom filter) 를 사용하여 학습 데이터와 테스트 데이터 간 overlapping 비율 검사
학습 데이터 내에서 Hold-out 후 모델의 Generalization 능력 평가

Bloom filter 사용

따라서, 8-gram Bloom filter를 통해, WebText 의 train set에서 각 Benchmark 데이터들의 test set에 속한 내용이 존재하는지 확인했다.

Bloom filter 란 해싱을 이용해서 메모리 효율적으로 특정 원소가 집합에 속하는지 검사하는 자료구조로, False Negative가 없는 것이 특징이다. (즉, overlapping 을 놓칠 가능성이 없다) 논문에서는 False Positive 또한 최소로 했다고 한다.

검사 결과, Benchmark 데이터별로 평균 3.2% (각각 1~6%) 비율로 WebText Train set과 overlapping 이 존재했다.

그런데, 주목할만한 것은 Benchmark 데이터들의 test set이 각각의 train set 과도 overlapping이 존재했고, 평균 5.9**%**로 WebText Train set보다 수치가 높았다.

이를 통해, Benchmark 데이터셋들의 Test set 과 WebText Train set과의 overlapping 비율이, 문제가 될 정도가 아니란 것을 설명하고 있다.

데이터셋별 overlapping 비율과 성능 증가폭은 다음과 같다.

CoQA : 15% overlapping → 0.5 ~ 1.0 F1 증가
LAMBADA : 1.2% overlapping → 0.3% Acc 증가

따라서, 모델의 성능에 대해 Memorization 의 영향이 크다고 보기 어렵다.

학습데이터 Hold-out 및 평가

WebText 데이터의 hold-out 을 바탕으로 Overfitting 을 확인할 경우, 모델의 Generalization 성능이 높지않고, Memorization에 의해 task를 수행한다고 볼 수 있기 때문이다.

GPT-2의 경우, 평가셋에 대한 성능이 학습셋과 비슷했다. 또한, 모델 사이즈를 키움에 따라 성능 또한 더 증가했다. 이는 모델이 학습데이터에 여전히 Underfit 되어, Memorization 에 다수 의존할만큼 학습데이터를 아직 과하게 학습하지 않았음을 의미한다. 따라서 GPT-2가 보여준 Benchmark 성능은 Generalization 성능에 가깝다

5. Related Work

Larger Model with Larger Data : 선행연구에서는 대량의 데이터로 더 큰 모델을 학습했을 때, 나은 성능을 얻었다. GPT-2 또한 1B 이상의 파라미터 모델로서 유사한 결과를 확인했다
learned-functionality : 선행연구에서 생성을 학습한 모델이 번역을 수행하는 것을 통해 새로운 task를 배우고 수행할 수 있음을 보였다.
Larger Web data 구축 : 선행연구에서 새로운 대량의 Web data 를 구축하는 방식을 참조하여 WebText 를 새로 구축하고 GPT-2 학습에 활용할 수 있었다.
Pre-training: 선행연구들에서 대량의 데이터로 사전학습한 모델이 NLP task들에서 더 좋은 성능을 가짐을 보였다.

6. Discussion

GPT-2 의 논문 결과는 Unsupervised Learning으로 새로운 task 를 배우는 능력을 충분히 보여주고 있다.

많은 데이터로 충분히 학습된 대용량 언어모델이, 지도학습을 위한 모델 수정이나, fine-tuning 없이도 새로운 task 를 배우고 수행할 수 있다는 것이다.

여전히 Summarization, QA, translation 에서의 결과는 아쉬우나 해당 task를 수행할 충분한 역량이 있다는 것을 알 수 있다.

GPT-2 는 extractive 하게 output 을 생성하는 선행연구에서 완전히 abstractive generation 으로 나아가는 연구라는데 의의가 있다.

GPT-2의 Fine-tuning 성능을 확인할 계획이나 BERT보다 낫다는 보장은 없다.

*논문읽기 전의 예상과 달리, GPT-2는 BERT가 GPT에게 도전했듯, 유사한 Frame 안에서 일부를 수정하여, BERT보다 나은 성능을 보여주는 방향이 아니였다.

성능에 집착하지 않고, 기존 GPT-1의 zero-shot 실험결과에서 task-learner의 특성을 확인하고 보완해서 GPT-2에서 입증했으며, Pretrained Large Language Model가 Meta Learner로 연구되는 지평을 열었다는 점에 대해 큰 감명을 받았다*

7. Conclusion

GPT-2는 zero-shot 을 통해 8개 중 7개의 데이터셋에서 SOTA를 달성했다.

이는 다양하고 방대한 데이터셋을 충분히 학습한 Large Language Model 이, 지도학습 없이 여러 task 를 수행하는 방법을 배울 수 있다는 것을 보여준다.

8. Appendix

8.1 Model capacity

Perplexity 기준으로, Larger 모델이 WebText 생성에서 더 나았다.

8.2 Text Memorization

두가지 테스트를 통해 긴 문장에서 GPT-2 가 얼마나 Memorization 에 의존하여 문장생성을 수행하는지 확인하였다.

게티스버그 연설로 테스트했을 때 Memorization 에 의해 연설문을 복구하는 모습을 보였지만, 100~200 token 이후에는 다르게 생성하기 시작했다. 이는 GPT-2가 문장생성에서 Memorization에 전적으로 의존하지 않는, diversity를 가짐을 보여준다.
WebText Test set의 일부가 주어질 때 GPT-2가 생성한 텍스트와 Test set 원문 각각이, WebText 의 Train set 과 얼마나 overlap되는지를 측정했다. 결과적으로, GPT-2의 생성텍스트가 원문보다 overlapping 되는 8-gram 의 비율이 적은 것을 확인했다.
(0% overlap의 비율이 GPT-2 텍스트가 더 높았고, 50%이상이 1% 미만의 확률로 겹치는 것을 확인)

8.3 Diversity

8.2 의 방법과 유사하게 WebText test set 의 context가 랜덤하게 주어졌을 때, GPT-2가 생성한 문장들을 보면 첫 단어 이후로 각각 다르게 텍스트를 생성한 것을 확인할 수 있다. (Table 12)

8.4 Robustness

talking unicorn news 에 대한 텍스트 생성 output 을 통해, GPT-2가 Out of Distribution context 를 대상으로 생성을 수행할 수 있는 가능성을 보여준다. (아직 뛰어나지는 않다)

추가공부) Zero-shot Implementation (Text Classfication)

Text Classification 에 대해 zero-shot 을 수행할 수 있는 2가지 방법을 소개한다.

1. example 과 각 Label의 Representation 간에 Cosine Similarity 를 계산, 가장 큰 유사도를 갖는 Label 선택

2. 분류 문장과 template 문장 합쳐서 모델에 input, language modeling으로써 다음 나오는 단어를 Label로 선택 (Huggingface/transformers 의 Pipeline API 사용)

# 예측방법1 - example과 각 Label representation 간 유사도 측정
from torch.nn import functional as F
from transformers import GPT2Tokenizer, GPT2Model

# 1. Load pertained GPT-2 model and tokenizer
model_name='cahya/gpt2-small-indonesian-522M'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name)

# 2. Prepare a test sentence and labels
sentence = 'Para menteri kabinet memainkan permainan yang tidak menyenangkan'
labels = ['olahraga', 'politik']

# 3. Since there is no padding token in this tokenizer, add a token. 
# A separate pad token can be added using the add_special_token function
tokenizer.pad_token = tokenizer.eos_token

# 4. Concatenate sentence with lables
inputs = tokenizer(['Para menteri kabinet memainkan permainan yang tidak menyenangkan',
                    'olahraga', 
                    'politik'],
                    return_tensors='pt', padding='longest')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
"""
input_ids  = tensor(
[[ 5461,  4529,  9212,  3609,  2485,   288,   467, 12672],
[26790,     0,     0,     0,     0,     0,     0,     0],
[26374,     0,     0,     0,     0,     0,     0,     0]])
"""

output = model(input_ids, attention_mask=attention_mask)[0] # torch.Size([3, 8, 768])
sentence_rep = output[:1].mean(dim=1)                       # torch.Size([1, 768]) 
label_reps = output[1:].mean(dim=1)                         # torch.Size([2, 768])

# now find the labels with the highest cosine similarities to the sentence
similarities = F.cosine_similarity(sentence_rep, label_reps)
closest = similarities.argsort(descending=True)
for ind in closest:
    print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')

"""
label: politik 	 similarity: 0.5492470860481262
label: olahraga 	 similarity: 0.48411038517951965
"""

# 예측방법2 - template 문장 추가 후 Label 들의 probability 비교
sequences = [
    "Tenet is simply an incredible film with deep complex concepts to unravel well after the credits roll.",
    "The Social Dilemma is densely packed yet lively and entertaining documentary"
]
candidate_labels = ["positive", "negative"]
hypothesis_template = "The sentiment of this review is {}."

classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template)

"""
[{'sequence': 'Tenet is simply an incredible film with deep complex concepts to unravel well after the credits roll.',
  'labels': ['positive', 'negative'],
  'scores': [0.9937942028045654, 0.006205802783370018]},

 {'sequence': 'The Social Dilemma is densely packed yet lively and entertaining documentary',
  'labels': ['positive', 'negative'],
  'scores': [0.9934840202331543, 0.006515993271023035]}]
"""

스터디원들과의 QnA 및 Discussion

(2. Approach) Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. (문장의 의미는?)

해당부분에서 zero-shot learning에 대한 빌드업을 한창 하고 있으므로, 문맥상 해당 문장의 supervised objective 는 'zero-shot을 통한 downstream task 수행'을 뜻한다고 볼 수 있다.

이 부분을 고려하며, 말씀하셨던 문장을 반으로 쪼개서 앞부분부터 먼저 보자면, language modeling 의 objective는 결국 pretrained 모델이 zero shot 을 통해 P(y|x)를 구하는 supervised objective와 유사하다. ($P(output|input)$ → $P(output|input, task)$)

왜냐하면, 분류를 수행한다고 가정할 경우, zero shot에서는 분류할 문장을 input으로 입력하여 Label 에 해당되는 다음 token이 예측되도록 하기 때문이다. 따라서 크게 본다면, 주어진 토큰 시퀀스에 대해, 다음 토큰을 예측하는 형태의 objective가 되므로 동일하다. (마치 template 문장을 써서 zero-shot 으로 classification 을 수행하듯)

다만 zero-shot supervised objective의 평가가 subset에서 이루어진다는 점에서 language modeling objective와 차이는 있다. language modeling 에서는 첫번째 토큰이 주어지면, 이후 예측하는 모든 token들을 예측하고 예측한 token 모두에 대해 정확도 점수를 계산한다. 그러나 zero-shot supervised objective의 점수평가는 Label에 해당하는, 예측된 마지막 token을 가지고만 점수를 평가한다.

사전학습과 zero shot 으로 새로운 task를 배우는 것 모두 language modeling objective 를 가지는 것을 이해했기 때문에, 둘은 global minimum 을 찾아가는 방식도 유사하다고 이해할 수 있다.

<Reference>

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Zero and Few Shot Learning

https://towardsdatascience.com/zero-and-few-shot-learning-c08e145dc4ed

Text Classify | Zero Shot Learning | HuggingFace

https://www.kaggle.com/code/kkhandekar/text-classify-zero-shot-learning-huggingface/notebook

Tensorflow text.Detokenizer

https://www.tensorflow.org/text/api_docs/python/text/Detokenizer

How to sample from language models

https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277

Evaluating QA: Metrics, Predictions, and the Null Response

https://qa.fastforwardlabs.com/no answer/null threshold/bert/distilbert/exact match/f1/robust predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html#Exact-Match

Why You Should Care About Byte-Level Sequence-to-Sequence Models in NLP

https://medium.com/analytics-vidhya/https-medium-com-tomkenter-why-care-about-byte-level-seq2seq-models-in-nlp-26bcf05dd7d3

'paper review > NLP' 카테고리의 다른 글

SpanBERT: Improving Pre-training by Representing and Predicting Spans (0)	2022.07.29
XLNet: Generalized Autoregressive Pretraining for Language Understanding 논문 리뷰 (0)	2022.07.22
(BERT) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 리뷰 (feat. SQuAD fine-tuning Code) (0)	2022.07.12
(GPT) Improving Language Understanding by Generative Pre-Training 논문 리뷰 (0)	2022.07.09
쉬운 ppt - (seq2seq 개선) Sequence to sequence learning with neural networks (0)	2021.03.09

항상 감사하며 마633

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문