软件漏洞发现:基于Transformer分析补丁中的Commit Message

2017年,Google 在论文 Attention is All you need 中提出了 Transformer 模型,其使用 Self-Attention 结构取代了在 NLP 任务中常用的 RNN 网络结构。而且实验也证明Transformer在效果上已经完败传统的 RNN 网络。在当下流行的很多大语言模型(包括预训练模型)中,Transformer也是必不可少的基石。

本博客之前的系列文章中已经详细地介绍了Transformer的原理(见【1】【2】),以及它在NLP中的应用(见【3】)。简单来说,Transformer  可以被看做是一个 Seq2seq 模型,它包括一个Encoder和一个Decoder。以机器翻译的场景为例,Encoder读入一个英语句子,Decoder输出一个德语句子。

本文主要演示Transformer在软件安全领域的一个应用。我们将通过分析软件补丁中的Commit Message部分,来挖掘出那些未上报的隐秘软件漏洞。


一、问题的引入

系统安全或软件安全的研究人员对CVE肯定不会陌生。它的英文全称是“Common Vulnerabilities & Exposures”,即通用漏洞披露。它是一个数据库,搜集了各自各样已知的软件漏洞。它的意义在于,对于一个软件A已经暴露出来的漏洞,所有使用含有漏洞版本的软件A的服务或应用的维护人员可以在第一时间,有意识地更新软件,从而避免受到攻击。另外,如果你是软件开发人员,如果你的产品中引用了含有漏洞的第三方代码,那么根据CVE披露的信息,你也可以第一时间做出反应,更新代码,排除由于软件代码复用而引入的漏洞。

然而现实中我们发现,存在大量的软件漏洞,它们是被“秘密修复”的。也就是说,即使软件的开发者或维护者发现了他们代码中的漏洞并及时修复了,但是这些漏洞并没有被上报到CVE系统。这样一来,那些使用了含有漏洞代码的其他软件或服务,就会被蒙在鼓里。尽管深陷危险而不自知。找出那些隐而不报的漏洞的任务称为security patch detection或者secretly ?xed bugs detection。这方面的工作有很多,感兴趣的读者,可以参考综述文章【5】以了解更多。

二、了解Patch Commit

在软件开发与维护过程中,开发人员会提交大量的 patch commit。这些patch commit,可能是用于修补漏洞(Security-related patches),也可能只是为了增加软件的功能或者提高软件的性能(统称为 non-security-related patches)。我们的任务就是要在开源软件中找出那些用于修补漏洞的Security-related patches。

本文中的方法主要关注于基于Git开发的软件仓库(例如GitHub或GitLab)。如下所示是一个用来修补CVE-2020-14354漏洞的patch commit。注意看到,一个patch commit通常可以分为两个部分:commit message 和 code change (也称作diff),二者用三条短横线 “---” 分割开。

In the diff, we can see the source code ?le “ares_getaddrinfo.c” is modi?ed. Moreover, the line beginning with “@@” in a chunk is in red font, which is to display some speci?c information, i.e. the start changed line number and the function name in the original ?le. The followings are code changes between the original ?le and the updated ?le, where the line marked with “-” means
such a line is removed from the previous version, while the line marked with “+” means that is a new line added in the current version (例如下图中绿色的部分).

Though a diff may show you the code is changed, only the commit message can properly tell the reason for such a change. A commit message is comprised of a subject, body, and footer, with both the body and footer being optional (例如图中紫色的部分).

  • The subject is a single line that best summarizes the changes made in the commit. By default, the subject of a single patch starts with “Subject: [PATCH]”. But if there exist multiple patches, the subject pre?x will instead be “Subject: [PATCH n/m]”. Sometimes when the change is so simple that no further explanation is necessary, a single subject line is ?ne. Thus, not every commit includes a body.
  • However, if a commit deserves a more thorough description, the body text can provide more details regarding the changes made in the commit. The body is separated by the subject using a blank line.
  • As the last component of a commit message, the footer text is found immediately below the body and is a preferable place to reference issues related to the commit changes. Between the footer and the body, there also will be a blank line as the separator.

三、文本的处理

在综述文章【5】中介绍的很多方法,要么只考虑diff的部分,要么把diff + commit message同时考虑。我们好奇,如果commit message是完备的(即没有缺失),那么仅仅检查commit message是否足以判定patch的性质?尽管commit message是采用类似自然语言的形式写成的。但是它也有很多特有的内容,需要针对性的考虑。例如,commit message可能会包含变量名、函数名等,这些往往会引起Out-Of-Vocabulary 问题。另外,commit message可能还会包含CVE编号,这往往是判断该patch与漏洞相关的重要特征。再比如,commit message中往往含有URL链接作为外部参考,还可能含有提交人的Email地址。这些内容对于自然语言理解来说,可能并不重要。即使与安全毫不相关的patch中也可能含有提交人的Email信息。所以在论文【4】中给出了基于经验的文本处理方法,主要的文本处理包含两部分:

1. 从patch中提取commit message。如果diff部分包含有代码注释,也一并提取。

2. 对提取的文本做Normalization,方法如下:

这两部分,都可以基于正则表达式来完成,具体方法可以详见【4】。

四、基于Transformer的分类器

我们在之前的文章【3】中用Transformer实现了一个分类器(用于判断自然语言的Sentiment)。本文中我们采用类似的方法,即仅仅使用Transformer中的Encoder部分来生成文本文件的Embedding,然后利用一个全连接前馈神经网络来完成最后的分类任务。下面给出实现代码(我们采用Keras框架来完成)。

首先引入必要的packages:

import os
import re
import csv
import sys
import copy
import subprocess
import bisect
import shutil
import pickle
import nltk
import gensim
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from os.path import expanduser
from filecmp import dircmp
from nltk.stem import WordNetLemmatizer
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

tf.random.set_seed(59)

wordnet_lemmatizer = WordNetLemmatizer()

我们省略了数据读入和处理的部分(有需要的读者可以联系作者获取数据集和文本处理部分的代码)。下面直接来看Transformer的部分:

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)
 
    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)  # self-attention layer
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)  # layer norm
        ffn_output = self.ffn(out1)  #feed-forward layer
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)  # layer norm


class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, weights=[w2v_model_50.wv.vectors], output_dim=embed_dim, trainable=False)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

在Transformer的输出处再加上全连接的神经网络分类器:

embed_dim = 50  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

下面是模型训练的部分。我们基于checkpoint保留在验证集上表现最好的模型:

filepath="model_{epoch:02d}-accuracy{accuracy:.2f}.h5"

checkpoint=ModelCheckpoint(
        filepath=filepath,
        monitor='val_accuracy',
        save_best_only=True,
        save_weights_only=True,
        save_freq='epoch'
    )


opt = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    x_train, y_train, validation_data=(x_val, y_val), epochs=60, batch_size=64, callbacks=[checkpoint])

最后导入已经训练好的模型,并验证其效果:

model.load_weights("model_XXX-accuracyYYY.h5")

predictions = np.argmax(model.predict(x_val), axis=1)
print("predictions shape:", predictions.shape)

print('Precision: %.3f' % precision_score(y_val, predictions))
print('Recall: %.3f' % recall_score(y_val, predictions))
print('F1_score: %.3f' % f1_score(y_val, predictions))
print('Accuracy: %.3f' % accuracy_score(y_val, predictions))

输出如下:

Precision: 0.930
Recall: 0.855
F1_score: 0.891
Accuracy: 0.917

下图给出了分类器在验证数据集上的ROC曲线,其中AUC=94.86%,可见预测的效果很好。

参考文献与推荐阅读材料

【1】LSTM已死,Transformer当立(LSTM is dead. Long Live Transformers! ):上

【2】LSTM已死,Transformer当立(LSTM is dead. Long Live Transformers! ):下

【3】基于Transformer实现文本分类(Keras/TensorFlow)

【4】Commit Message Can Help: Security Patch Detection in Open Source Software via Transformer, International Conference on Software Engineering Research, Management and Applications, 2023

【5】Vulnerability discovery based on source code patch commit mining: a systematic literature review, International Journal of Information Security, 2024

【本文完】