题解 | #基于卡方检验的特征选择实现#

核心概念

卡方检验：衡量特征（词汇）与类别（情感）独立性的统计方法，计算公式： $\chi^2 = \frac{N \times (AD - BC)^2}{(A+B) \times (C+D) \times (A+C) \times (B+D)}$ 其中：

$A$ ：包含该词且正面的文档数
$B$ ：包含该词且负面的文档数
$C$ ：不包含该词且正面的文档数
$D$ ：不包含该词且负面的文档数
$N$ ：总文档数（ $A+B+C+D$ ）

统计意义：

$\chi^2$ 值越大，特征与类别相关性越强
$\chi^2 = 0$ 表示特征与类别独立

解题思路

数据读取：
- 读取文档数量 $N$
- 解析每行文档的标签和文本内容
- 读取需要选择的特征数 $k$
特征统计：
- 统计每个词汇在正/负面文档中的出现次数
- 统计正/负面文档总数
卡方计算：
- 对每个词汇构建 $2 \times 2$ 列联表
- 根据公式计算卡方统计量
特征选择：
- 按卡方值降序排序
- 卡方值相同时按字母顺序升序
- 选择前 $k$ 个特征

代码解析

import sys
from collections import defaultdict

def read_data(N):
    """读取文档数据和标签"""
    documents = []  # 文档单词列表
    labels = []     # 文档标签列表
    for _ in range(N):
        line = sys.stdin.readline().strip()
        label, text = line.split('\t', 1)  # 分割标签和文本
        words = text.split()  # 分割单词（保留大小写）
        documents.append(words)
        labels.append(label)
    return documents, labels

def compute_chi_square(documents, labels):
    """计算每个词汇的卡方统计量"""
    # 初始化统计结构
    word_set = set()  # 所有词汇集合
    word_counts = defaultdict(lambda: defaultdict(int))  # 词汇-标签计数
    label_counts = defaultdict(int)  # 标签计数
    
    # 统计标签分布
    for label in labels:
        label_counts[label] += 1
    
    # 统计词汇在各类别中的文档频率
    for words, label in zip(documents, labels):
        unique_words = set(words)  # 文档去重
        for word in unique_words:
            word_set.add(word)
            word_counts[word][label] += 1
    
    # 计算卡方值
    N = len(documents)
    chi_square_scores = {}
    for word in word_set:
        A = word_counts[word]['positive']  # 正面出现
        B = word_counts[word]['negative']  # 负面出现
        C = label_counts['positive'] - A   # 正面未出现
        D = label_counts['negative'] - B   # 负面未出现
        
        # 计算卡方分子和分母
        numerator = N * (A * D - B * C) ** 2
        denominator = (A + B) * (C + D) * (A + C) * (B + D)
        
        # 处理分母为零的情况
        chi_square = numerator / denominator if denominator != 0 else 0
        chi_square_scores[word] = chi_square
    
    return chi_square_scores

def select_top_k_features(chi_square_scores, k):
    """选择前k个特征"""
    # 排序规则：卡方值降序 -> 字母升序
    sorted_features = sorted(chi_square_scores.items(),
                            key=lambda x: (-x[1], x[0]))
    return [feature for feature, _ in sorted_features[:k]]

if __name__ == "__main__":
    # 读取文档数量
    N = int(sys.stdin.readline().strip())
    
    # 读取文档数据
    documents, labels = read_data(N)
    
    # 读取特征数量k
    k = int(sys.stdin.readline().strip())
    
    # 计算卡方统计量
    chi_scores = compute_chi_square(documents, labels)
    
    # 选择并输出特征
    top_features = select_top_k_features(chi_scores, k)
    for feature in top_features:
        print(feature)

关键函数说明

read_data(N)：
- 输入：文档数量 $N$
- 输出：文档列表（单词列表）和标签列表
- 过程：解析每行文档，分割标签和文本
compute_chi_square(documents, labels)：
- 输入：文档列表，标签列表
- 输出：词汇到卡方值的映射字典
- 过程：
  - 统计标签分布
  - 统计每个词汇在各类别中的文档频率（文档去重）
  - 构建 $2 \times 2$ 列联表
  - 计算卡方值
select_top_k_features(chi_square_scores, k)：
- 输入：卡方值字典，特征数 $k$
- 输出：前 $k$ 个特征列表
- 排序规则：
  - 主序：卡方值降序
  - 次序：词汇字母升序

应用场景

文本分类特征选择
情感分析
垃圾邮件检测
关键词提取

总结

本题实现了基于卡方检验的文本特征选择：

通过统计词汇在正/负面文档中的分布
计算每个词汇的卡方统计量
根据统计量和字母顺序筛选关键特征
输出最具判别力的词汇

该方法能有效提升文本分类模型的性能和效率，帮助团队优化客户评论分析系统，为产品改进提供数据支持。