2023_FINAL

一、FINAL [2023]

《FINAL: Factorized Interaction Layer for CTR Prediction》

普通的multi-layer perceptron: MLP 网络在学习multiplicative feature interactions 方面效率低下，因此feature interaction learning 成为CTR 预测的重要主题。现有的feature interaction 网络可以有效地补充MLP的学习，但单独使用时，它们的性能往往不如MLP 。因此，将它们与MLP 网络集成对于提高性能是必要的。这种情况促使我们探索一种更好的 MLP backbone 替代方案，以取代MLP。受factorization machines 的启发，在本文中，我们提出了FINAL，这是一种factorized interaction layer ，它扩展了广泛使用的线性层，并且能够学习二阶feature interactions 。与MLP类似，多个FINAL layers可以堆叠成一个FINAL block ，从而产生具有指数级增长的feature interactions 。我们将feature interactions 和MLP 统一到单个FINAL block 中，并通过经验证明其作为MLP block 替代品的有效性。此外，我们探索了两个FINAL blocks 的ensemble 作为增强型双流CTR 模型（enhanced two-stream CTR model ），在开放benchmark 数据集上创造了新的SOTA。FINAL 可以轻松用作building block ，并且已在华为的多个应用程序中实现了业务指标增益。我们的源代码将在MindSpore/models 和FuxiCTR/model_zoo 上提供。
CTR 预测任务通常被表述为二分类问题，其中包含丰富但异构的特征，例如user profiles, item attributes, and session contexts。因此，feature interaction learning 成为CTR 预测的重要研究课题。现有方法通常遵循两个方向来建模feature interactions 。
- 第一个是使用multi-layer perceptrons: MLP 隐式地学习特征之间的隐藏关系。虽然已经证明MLP 理论上可以近似任何有界的连续函数，但在实践中，在给定有限网络大小的情况下，它们在建模combinatorial feature interactions 方面很弱。
- 第二种方法是使用特征之间的multiplicative operations 来显式地建模它们的交互。例如，DCN，FM，xDeepFM，这些方法中的feature combination degree 通常与堆叠层数成线性比例，因此需要相当深的架构才能全面覆盖有用的特征组合。然而，由于许多广泛记载的问题，例如gradient explosion/vanishmen （《Which neural net architectures give rise to exploding and vanishing gradients?》）和rank collapse （《Attention is not all you need: Pure attention loses rank doubly exponentially with depth》、《Rank diminishing in deep neural network》），很难优化非常深的模型。
因此，现有方法很难有效地建模高阶feature interactions 。
在本文中，我们提出了一个Factorized Interaction Layer: FINAL 来显式地学习multiplicative feature interactions ，它可以实现非常高的combination orders 而无需繁琐的层堆叠（Figure 1 ）。受fast exponentiation 算法的启发，FINAL 采用hierarchical 的方式以指数级速度提高feature interaction 阶次。在每个hierarchy 中，input representations 与一系列连续的non-linear layers 的representations output 相乘，从而逐步增加feature interaction 阶次。通过用多个hierarchies 处理feature representations ，feature interactions 的阶次进一步呈指数级增加。基于提出的FINAL 模块，我们设计了一个统一的框架，该框架结合了多个FINAL blocks 从而在不同视图中学习feature interactions ，并且我们通过使用它们的预测作为common teachers 来交换它们所编码的互补知识来进行自蒸馏（self-distillation ）。我们在四个公共数据集上进行了广泛的实验，结果验证了FINAL 的优越性。它还在我们企业举办的多个商业场景的在线实验中取得了显著的成功。FINAL 为编写CTR 预测模型提供了一种全新而简单的选项，有望为各种推荐场景提供支持。
论文创新性一般。
DCN V3 也是类似的思路：Exponential Cross Network、以及 self-distillation 。但是 DCN V3 采用的是 1, 2, 4, 8, 16,... 这样的指数，而这里用的 1, 3, 9, 27, ... 这样的指数。另外，这里的 FINAL Block 仅仅得到最高指数，而没有使用残差连接，因此往往需要多个不同层的 FINAL Block 并行拼接。

1.1 模型

FINAL 总体框架如Figure 2 所示。它主要由多个并行的FINAL blocks 组成，旨在建模不同的feature interaction 模式。每个FINAL block 包含几个factorized interaction layers ，其中feature interactions 的最大阶次随模型深度呈指数级增长。来自不同blocks 的预测分数被组合成一个统一的预测分数作为final prediction ，它也充当virtual teacher 从而self-teach 这些blocks 以交换和fuse 它们的隐藏知识。通过这种方式，可以有效、全面地捕捉复杂的feature interactions 。
FINAL Block：我们方法的基本单元是FINAL blockflattened feature vector $\mathbf{\vec x}$ 作为输入，其中可能包括各种各样的field ，例如one-hot features, embedded categorical features, and numerical features。由于工业场景中精心设计的特征的多样性、复杂性和异质性，特征之间的交互通常可能是很复杂的且隐式的。因此，建模高阶交互对于有效利用特征至关重要。在实践中，如何以最小的model depths 实现足够高的交互阶次，对性能和效率都很重要。
受fast exponentiation 算法思想的启发，我们设计了一种hierarchical 的feature interaction 机制来实现指数级的阶次增长。在每个hierarchy 中，使用factorized interaction layer 通过几个multiplicative operationsfeature interaction $\mathbf{\vec x}_{l-1}$ $l$ 个factorized interaction layer 的输入。它使用以下公式进行转换：
$\begin{matrix} {\vec{h}}_{l, 1} = W_{l, 1} {\vec{x}}_{l - 1} + {\vec{b}}_{l, 1} \\ {\vec{h}}_{l, 2} = {\vec{h}}_{l, 1} ⊙ σ (W_{l, 2} {\vec{x}}_{l - 1} + {\vec{b}}_{l, 2}) \\ ⋮ \\ {\vec{h}}_{l, N} = {\vec{h}}_{l, N - 1} ⊙ σ (W_{l, N} {\vec{x}}_{l - 1} + {\vec{b}}_{l, N}) \\ {\vec{x}}_{l} = \sum_{i = 1}^{N} {\vec{h}}_{l, i} \end{matrix}$
其中：
- $\mathbf{\vec x}_l$ 为layer output。
- $\mathbf W_{l,i}, \mathbf{\vec b}_{l,i}$ $i$ operation $N$ 为multiplicative interaction 的数量。
- $\sigma(\cdot)$ 为激活函数。
直观地讲，feature interaction 阶次与multiplicative operations 的数量成正比。通过聚合每个step 中的中间结果，每个层的输出可以包含multi-granularity 的feature interactions 。在FINAL block 中，我们堆叠多个factorized interaction layersinitial feature interaction degree $K$ $N$ 个multiplicative operationsFINAL block $\mathbf{\vec x}$ $N^K$ 。
Cross-block Knowledge Transfer：在我们的方法中，我们倾向于使用多个FINAL blocks 来从不同视图来学习feature interactions 。
- 我们首先使用不同的线性投影层（linear projection layers ）将hidden representations 转换为output logits 。
- 这些logits 按均值聚合为统一的logitsigmoid $\hat y$ ）。
我们使用二元交叉熵损失来计算 CTR prediction loss ，如下所示：
$L_{c} = - \frac{1}{S} \sum_{i = 1}^{S} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]$
$y_i$ $\hat y_i$ $i$ label $S$ 是训练数据规模。
为了促进不同FINAL block 之间的knowledge sharing ，我们执行self-knowledge distillationinter-block knowledge $\hat y$ 作为老师，并鼓励每个block 从这个synthesized prediction 中学习。以dual-blockFigure 2 $\hat y_a$ $\hat y_b$ 表示两个block 的normalized prediction scores。它们对应的知识蒸馏损失（knowledge distillation losses ）如下：
$\begin{matrix} L_{d} = - \frac{1}{S} \sum_{i = 1}^{S} [{\hat{y}}_{i} \log ({\hat{y}}_{a, i}) + (1 - {\hat{y}}_{i}) \log (1 - {\hat{y}}_{a, i})] \\ L_{d}^{'} = - \frac{1}{S} \sum_{i = 1}^{S} [{\hat{y}}_{i} \log ({\hat{y}}_{b, i}) + (1 - {\hat{y}}_{i}) \log (1 - {\hat{y}}_{b, i})] \end{matrix}$
$\hat y_{a,i}, \hat y_{b,i}$ $i$ 个样本的block-specific predictions 。
ground-truth $y_i$ 作为 teacher？
我们使用task loss 和knowledge transfer regularizations 来优化模型，模型训练的整体损失函数为：
$L = L_{c} + L_{d} + L_{d}^{'}$
这样，每个block 都知道任务监督信号和cross-block 知识，因此可以更好地应对复杂的feature interactions 。
knowledge transfer regularizations 本质上是迫使每个子网的输出都接近模型的整体输出。
parallel blocks $O(NK)$ 。为了达到相同的feature interactionlayer-stacking $O(N^K)$ $N$ $K$ 较小时，我们的方法具有与现有方法相当的效率，并且在建模极高阶feature interactions 时可能具有显著的效率优势。此外，FINAL block 是一个即插即用的模块，可以直接插入到现有架构、或替换现有架构的MLP-based 的feature interaction 模块。因此，FINAL 与各种CTR 预测方法兼容，并且可以轻松地为它们提供支持。

1.2 实验

数据集：Criteo, Avazu, MovieLens, and Frappe。
为了公平比较，我们重用了《Adaptive factorization network: Learning adaptive-order feature interactions》发布的预处理数据集，并遵循相同的splitting 和预处理程序。
评估指标：AUC 。
baselines：我们将其与四类现有模型进行比较，按feature interactions 阶次分类：
- 一阶（仅使用单个特征）：Logistic Regression: LR 。
- 二阶（建模pair-wise feature interactions ）：Factorized Machine (FM) and AFM。
- 三阶（建模triple-wise feature interactions ）：CrossNet （两层）、CrossNetV2 （两层）和CIN （两层）。
- 高阶：DCN、DCNV2、DeepFM、AutoInt、xDeepFM 和SAM。
实现细节：我们基于开源CTR prediction library ，即FuxiCTR ，实现了所有研究的模型。为了进行公平比较，我们遵循 《Adaptive factorization network: Learning adaptive-order feature interactions》 中的相同实验设置。
所有baseline 均使用Adam optimizer 进行训练，其中学习率为0.001 ，batch size 为4096 ，embedding 维度为10，MLP 隐单元的数量为[400, 400, 400] 。
我们采用两个FINAL blocksfactorized interaction layers $K=2, N = 2$ ）。
Table 1 展示了在四个数据集上的评估结果，从中我们得到以下发现：
- 首先，LR 在所有数据集上的表现最差，这表明feature interaction modeling 在CTR预测中的必要性。
- 其次，能够建模高阶feature interactions 的方法往往会获得更好的性能，这是直观的，因为可以考虑更复杂的feature relatedness 。由于FINAL 在建模高阶feature interactionst-test $p\lt 0.05$ ）。这验证了FINAL 在捕获复杂特征关系方面的有效性。
- 第三，dual-block FINAL model 略优于single-block 模型。这可能是因为使用多个blocks 有助于学习具有不同结构和初始化参数的diverse feature interaction 信息。
- 第四，self-knowledge distillation 可以进一步提高multi-block FINAL model 的性能。这进一步表明了不同blocks 中编码的知识具有互补性，使用知识蒸馏（knowledge distillation ）将它们融合可以更好地指导FINAL block learning 。
我们的FINAL block 是一个即插即用模块，可以提升各种deep CTR 模型的性能。为了证明FINAL block 的兼容性，我们将其作为MLP block 的替代品引入了四种流行的deep CTR 模型（即MLP, DeepFM, xDeepFM, and DCN ），结果如Table 2 所示。
我们观察到FINAL block 一致地改进了流行的deep CTR 模型。这验证了FINAL 确实捕获了这些模型忽略的有用线索。由于FINAL 独立于backbone 架构，因此它是一个灵活的组件，可用于为实际系统中的各种CTR 预测模型提供支持。
这本质上是模型的集成，因此效果比原始模型更好是可以预期的。
在线评估：由于其显著的性能提升和低延迟，我们在企业的多个商业场景中部署了FINAL 。在本节中，我们选取两个代表性场景来展示其优越性。
- News Feed 推荐：我们在商业新闻推荐场景中进行在线评估，其中数百万日活用户消费数字新闻文章。在线A/B test 持续一个月，从2022年9月25日到10月25 日。对于online serving ，我们将整个流量的5% 作为实验组，其中包括超过300k 活跃用户。我们将我们的方法与精心设计的baseline 模型进行了比较。Figure 3 总结了连续30 天的在线结果。我们的模型在评估期间显示出一致性的在线点击率改进，平均点击率提高了3.17% 。additional online inference latency 增加了22.22% ，这在我们的系统中是可以接受的。实验结果证明了FINAL 在feed recommendation 中的有效性。
- Online advertisement display ：在线广告需要同时预测点击率和Post-click conversion rate: CVR 。在我们的广告展示场景中，转化对应于安装应用程序、提交注册信息、用户留存（user retention ）等事件。
  多任务学习（multi-task learning: MTL ）是联合CTR and CVR estimation 的常用解决方案。一般来说，MTL 采用具有shared-bottom 结构的模型，其中bottom embedding layers 的参数在任务之间共享。然后，应用MLP 模块从shared bottom 来学习feature interactions 并对特定任务进行预测。我们使用FINAL block 替此MTL 框架中的MLP 进行比较。对于online serving ，我们随机选择5％ 的用户作为实验组，通过FINAL-enhanced model 提供广告推荐。对照组为另外5% 的用户，采用baseline MTL model 。连续7 天的在线A/B test 结果显示，整体CVR 增益为5.52% 。结果验证了FINAL 对于在线广告的有效性。