1. 机器翻译中BLEU 计算

BLEU 计算

评价机器翻译结果通常使用BLEU（Bilingual Evaluation Understudy）。对于模型预测序列中任意的子序列，BLEU 考察这个子序列是否出现在标签序列中。

具体来说，设词数为 $ n $ 的子序列的精度为 $ p_n $。它是预测序列与标签序 列匹配词数为 $n$ 的子序列的数量 与预测序列 中词数为 $ n $ 的子序列的数量之比。举个例子，假设标签序列为 A、B、C、D、E、F，预测序列为 A、B、B、C、D，

那么 $p_1 = 4/5,\ p_2 = 3/4,\ p_3 = 1/3,\ p_4 = 0$。具体计算如下：

1-gram: 标签序列为{A, B, C, D, E, F}, 预测标签为{A, B, B, C, D}。总预测数为 5, 4 个出现在标签中，所以 $p_1=4/5$。
2-gram: 标签序列为{AB, BC, CD, DE, EF}, 预测标签为{AB, BB, BC, CD}。总预测数为 4, 3 个出现在标签中，所以 $p_2=3/4$。
3-gram: 标签序列为{ABC, BCD, CDE, DEF}, 预测标签为{ABB, BBC, BCD}。总预测数为 3, 1 个出现在标签中，所以 $p_3=1/3$。
类似计算。

设标签序列词数为 $\text{len}_\text{label} $ 和预测序列的词数次数 $\text{len}_\text{lpred} $。那么，BLEU 的定义为:
$$
\exp\left(\min\left(0, 1 – \frac{\text{len}_{\text{label}}}{\text{len}_{\text{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n} \tag{1}
$$
其中 $k$ 是我们希望匹配的子序列的最大词数。可以看到当预测序列和标签序列完全一致时，BLEU 为 1。

连乘符号前的式子 $\exp\left(\min\left(0, 1 – \frac{len_{\text{label}}}{len_{\text{pred}}}\right)\right)$ 是惩罚因子，因为匹配较长子序列比匹配较短子序列更难，BLEU 对匹配较长子序列的精度赋予了更大权重。例如，当 $p_n$ 固定在 0.5 时，随着 $n$ 的增大，$0.5^{1/2} \approx 0.7, 0.5^{1/4} \approx 0.84, 0.5^{1/8} \approx 0.92, 0.5^{1/16} \approx 0.96$。另外，模型预测较短序列往往会得到较高 $p_n$ 值。因此，上式中连乘项前面的系数是为了惩罚较短的输出而设的。举个例子，当 $k=2$ 时，假设标签序列为 A、B、C、D、E、F，而预测序列为 A、B。虽然 $p1=p2=1$，但惩罚系数 $\exp(1-6/2) \approx 0.14$，因此 BLEU 也接近 0.14。

BLEU 计算代码：

def bleu(pred_tokens, label_tokens, k):
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[''.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[''.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[''.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score