As machine learning models penetrate critical areas like medicine, the criminal justice system, and financial markets, the inability of humans to understand these models seems problematic. [1]

Interpretable Machine Learning

Open ATA里的可解释性总结

Interpretability must be qualified. To be meaningful, any assertion regarding interpretability should fix a specific definition.
Transparency may be at odds with the broader objectives of AI.
全局模型无关解释方法（Global Model-Agnostic）& 局部模型无关解释方法（Local Model-Agnostic）

Shapley Value

在合作博弈论中，是一种在合作的一组玩家之间公平分配总收益或成本的方法。例如，在每个成员贡献不同的团队项目中，Shapley Value提供了一种方法来确定每个成员应得到多少赞誉或责备。

例子：假设一个简化的企业生产环境，所有者提供了关键的资本，因为没有他/她，就无法获得任何收益。有个工人，每个工人都为总利润贡献了金额。

我们则有公司员工集合

假设价值函数:

我们可以根据Shapley Value计算公式得到不同人的贡献：

Shapley Value计算公式

Game Theory for Economic Analysis. [1]

子集组合排序

R的全排序

在第位上。

（示例）计算所有者贡献

令则有

SHAP-一种局部模型无关的可解释性方法

A Unified Approach to Interpreting Model Predictions [2]

https://shap.readthedocs.io/en/latest/

这里只提及了加权线性回归来近似计算（Kernel SHAP），还有深度学习网络（Deep SHAP）

一种用于解释任何机器学习模型输出的博弈论方法。它利用博弈论中的经典 Shapley 值及其相关扩展，将最优信用分配(optimal credit allocation)与局部解释联系起来，即归因值。

归因值之和+基准值应该为预测值

加性特征归因三个特性

多个特征（或变量）的总贡献被认为是它们单独贡献的总和。

局部准确性 (Local Accuracy)
- 解释模型在简化输入 x′ 对应原始输入 x 时，至少匹配原始模型 f 的输出
- 所有特征的归因之和，精确等于模型输出与基线（背景分布均值）的差值。
可缺失性 (Missingness)
- 对于任何在所有上下文下都不改变模型输出的特征，必须
- 即，如果模型并未“用到”这个特征，它就不应分到任何贡献。
一致性 (Consistency)
- 若在另一个模型中，对每一种上下文，若有，则应有

大部分方法，例如LIME、DeepLIFT只满足其中一两个特性，而传统Shaley Value估计方法能同时满足这三个特性。

归因值 φᵢ：确实是把特征 i 在所有可能的子集 S 中“加入”时对模型预测的边际提升累计起来。
模型预测：SHAP 中用的 v(S) 就是“在保留特征子集 S 值不变、其他特征随机化”下的条件期望。根据一个基准值来计算。

加权平均：
- 每个边际贡献项

- 权重

在实际计算时，为了估计那个条件期望，往往会用背景样本做 Monte Carlo 近似，或者针对决策树用专门的 Tree SHAP 算法做精确计算。

Examples

Text

Image

import json

import numpy as np
import torch
import torchvision

import shap

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = torchvision.models.mobilenet_v2(pretrained=True, progress=False)
model.to(device)
model.eval()
X, y = shap.datasets.imagenet50()

# Getting ImageNet 1000 class names
url = "https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json"
with open(shap.datasets.cache(url)) as file:
    class_names = [v[1] for v in json.load(file).values()]
print("Number of ImageNet classes:", len(class_names))
# print("Class names:", class_names)

# Prepare data transformation pipeline

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]


def nhwc_to_nchw(x: torch.Tensor) -> torch.Tensor:
    if x.dim() == 4:
        x = x if x.shape[1] == 3 else x.permute(0, 3, 1, 2)
    elif x.dim() == 3:
        x = x if x.shape[0] == 3 else x.permute(2, 0, 1)
    return x


def nchw_to_nhwc(x: torch.Tensor) -> torch.Tensor:
    if x.dim() == 4:
        x = x if x.shape[3] == 3 else x.permute(0, 2, 3, 1)
    elif x.dim() == 3:
        x = x if x.shape[2] == 3 else x.permute(1, 2, 0)
    return x


transform = [
    torchvision.transforms.Lambda(nhwc_to_nchw),
    torchvision.transforms.Lambda(lambda x: x * (1 / 255)),
    torchvision.transforms.Normalize(mean=mean, std=std),
    torchvision.transforms.Lambda(nchw_to_nhwc),
]

inv_transform = [
    torchvision.transforms.Lambda(nhwc_to_nchw),
    torchvision.transforms.Normalize(
        mean=(-1 * np.array(mean) / np.array(std)).tolist(),
        std=(1 / np.array(std)).tolist(),
    ),
    torchvision.transforms.Lambda(nchw_to_nhwc),
]

transform = torchvision.transforms.Compose(transform)
inv_transform = torchvision.transforms.Compose(inv_transform)

def predict(img: np.ndarray) -> torch.Tensor:
    img = nhwc_to_nchw(torch.Tensor(img))
    img = img.to(device)
    output = model(img)
    return output

# Check that transformations work correctly
Xtr = transform(torch.Tensor(X))
out = predict(Xtr[1:3])
classes = torch.argmax(out, axis=1).cpu().numpy()
print(f"Classes: {classes}: {np.array(class_names)[classes]}")

topk = 4
batch_size = 50
n_evals = 10000

# define a masker that is used to mask out partitions of the input image.
masker_blur = shap.maskers.Image("blur(128,128)", Xtr[0].shape)

# create an explainer with model and image masker
explainer = shap.Explainer(predict, masker_blur, output_names=class_names)

# feed only one image
# here we explain two images using 100 evaluations of the underlying model to estimate the SHAP values
shap_values = explainer(
    Xtr[1:2],
    max_evals=n_evals,
    batch_size=batch_size,
    outputs=shap.Explanation.argsort.flip[:topk],
)

(shap_values.data.shape, shap_values.values.shape)

shap_values.data = inv_transform(shap_values.data).cpu().numpy()[0]
shap_values.values = [val for val in np.moveaxis(shap_values.values[0], -1, 0)]

shap.image_plot(
    shap_values=shap_values.values,
    pixel_values=shap_values.data,
    labels=shap_values.output_names,
    true_labels=[class_names[132]],
)

SHAP for MultiModal

刚刚的单模态SHAP里面有另外两个比较重要的特性

权重值和应为1

归因值之和+基准值应该为预测值

多模态与单模态的区别

单模态只处理单一类型的数据，如

文本（Text）
图像（Image）
语音（Audio）
结构化特征（Tabular）

而多模态同时处理**多种类型**的数据，比如：

图像 + 文本（如：图文检索、图像字幕生成）
语音 + 视频（如：视频理解、虚拟主播）
传感器信号 + 时间序列（如：多传感器融合的智能驾驶）

多模态模型架构与融合策略

早期融合（Early Fusion）
- 在最底层就把不同模态特征拼接／加权，送入统一网络
- “轻量级” CNN
- 特征维度极易爆炸，尤其当各模态维度都较高时会导致计算和存储压力很大；拼接后，网络往往无法区分哪些模式是“重要的”，容易引入大量冗余信息；
中期融合（Mid Fusion）
- 各模态先独立编码，再在中间层交互（如多头注意力跨模态）
- 例如LXMERT

晚期融合（Late Fusion）
- 各模态单独预测，最后以投票或加权平均融合结果
- 例如 CLIP，图像和文本各自用独立编码器，映射到相同维度的嵌入空间，不在网络内部交互；最后通过相似度计算（如点积或余弦）来判断它们的关联性或并行输出多个任务的得分后再加权。

所以传统的SHAP没有办法解决

双模态下Shapley Value的推理

假定，图像特征为而文本特征为

对于图像特征，它的Shapley Value为

其中为权重。根据上面单模态SHAP的推论，不同模态下的权重之和应为1：

而图像模态的价值（预测值）应为：

与此同时根据刚刚的权重计算方式（组合方式），而不是像单模态那样直接把特征值拉通：

为什么需要+1，因为对于文本模态来说，图像模态可以看作是它的一个特征输入。

接着

如何使用SHAP来解释多模态

方案一

MM-SHAP: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks [4]

对一条样本 (sentence,image_path)(\text{sentence},,\text{image_path})(sentence,image_path)：
1. 用 tokenizer 将句子转成 input_ids（长度为 L）。
2. 用 FRCNN 抽取整张图像的 36 个 ROI 特征features（形状 [1,36,d][1,36,d][1,36,d]）和对应的normalized_boxes。
定义 custom_masker(mask, x)
1. mask 是长度为 L+36L 的布尔向量：前 L位用来控制文本 token，后 36 位用来控制图像 patch。
定义 predict
初始化 SHAP Explainer
对单条测试样本计算 SHAP 值

# ---------- 1. 准备：预处理函数和模型加载 ----------

def preprocess(sentence, image_path):
    # 文本编码
    inputs = tokenizer(
        sentence,
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN,
        return_tensors="pt"
    )
    # 图像特征提取
    imgs, sizes, scales = image_preprocess(image_path)
    output = frcnn(imgs.cuda(), sizes, scales, return_tensors="pt")
    feats = output["roi_features"]         # [1, PATCH_NUM, feat_dim]
    boxes = output["normalized_boxes"]      # [1, PATCH_NUM, 4]
    return inputs, feats, boxes

# ---------- 2. 定义 masker 与模型预测 ----------

def custom_masker(mask, x):
    ids, feats = x           # ids: [1, L], feats: [1, PATCH_NUM, D]
    mask = torch.tensor(mask)  # 长度 = L + PATCH_NUM

    # 文本部分：mask 为 False 的位置用 PAD_ID
    masked_ids = ids.clone()
    masked_ids[~mask[:ids.shape[1]]] = PAD_ID

    # 图像部分：mask 为 False 的 patch 特征置 0
    masked_feats = feats.clone()
    patch_mask = mask[ids.shape[1]:].bool()  # [PATCH_NUM]
    masked_feats[:, ~patch_mask, :] = 0.0

    return masked_ids, masked_feats

def predict(x):
    ids, feats = x
    out = model(
        input_ids=ids.cuda(),
        attention_mask=torch.ones_like(ids).cuda(),
        visual_feats=feats.cuda(),
        visual_pos=boxes.repeat(ids.shape[0], 1, 1).cuda(),
        token_type_ids=torch.zeros_like(ids).cuda(),
        return_dict=True
    )
    probs = torch.softmax(out["cross_relationship_score"], dim=1)
    return probs[:, 1].detach().cpu().numpy()  # 每个样本的得分

# ---------- 3. 初始化 SHAP Explainer ----------

# 选一些 (sentence, image_path) 作为背景集，得 background_ids, background_feats
masker = shap.maskers.Partition(mask=custom_masker,
                                sample_data=(background_ids, background_feats))
explainer = shap.Explainer(predict, masker)

# ---------- 4. 对单个样本快速计算 SHAP 值 ----------

# 主循环中每次只需：
inputs, feats, boxes = preprocess(test_sentence, test_image_path)
x = (inputs.input_ids, feats)       # [1, L], [1, PATCH_NUM, D]
shap_vals = explainer(x)            # shap_vals.values 形状为 [1, L+PATCH_NUM]

# 解释结果：
# 文本 token 的 attribution = shap_vals.values[0][:L]
# 图像 patch 的 attribution = shap_vals.values[0][L:]

方案二

把“文本看作一个整体”，把它当作一个超级输入，对比各个图像 patch 的边际贡献。
同理对图像再进行一次
合并到每个子输入的最终归因

延伸

Attention, please! PixelSHAP reveals what vision-language models actually focus on. —— Roni Goldshmidt, 2025

通过 Mask R-CNN、SAM（Segment Anything Model）等分割算法生成候选对象或区域；
依次遮蔽单个对象（以基线贴图或随机噪声替换该区域），观察 VLM 重新生成文本的输出差异；
根据 Shapley 公式集成这些差异，得到每个“图像对象”对文本输出的贡献评分。

References

[1]. Lipton, Z. C. (2016). The mythos of model interpretability. arXiv preprint arXiv:1606.03490. https://doi.org/10.48550/arXiv.1606.03490

[2]. For a proof of unique existence, see Ichiishi, Tatsuro (1983). Game Theory for Economic Analysis. New York: Academic Press. pp. 118–120. ISBN 0-12-370180-5.

[3]. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In I. Guyon et al. (Eds.), Advances in Neural Information Processing Systems, 30, 4765–4774. Curran Associates, Inc.

[4]. Parcalabescu, L., & Frank, A. (2023). MM-SHAP: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL ’23) (pp. 4032–4059). Association for Computational Linguistics.

[5]. Lipovetsky, S., & Conklin, M. (2001). Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 17(4), 319–330.

[6]. Goldshmidt, R. (2025). Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On. arXiv preprint arXiv:2503.06670.

[7]. Horovicz, M., & Goldshmidt, R. (2024, July 14). TokenSHAP: Interpreting large language models with Monte Carlo Shapley value estimation. arXiv preprint arXiv:2407.10114.

Multimodal Interpretation