逻辑回归需要掌握的知识点

知道逻辑回归的损失函数
知道逻辑回归的优化方法
知道sigmoid函数
知道逻辑回归的应用场景
应用LogisticRegression实现逻辑回归预测
知道精确率、召回率指标的区别
知道如何解决样本不均衡情况下的评估
了解ROC曲线的意义说明AUC指标大小
应用classification_report实现精确率、召回率计算
应用roc_auc_score实现指标计算

文章目录

逻辑回归需要掌握的知识点
基于逻辑回归的癌症预测案例
癌症分类预测-良／恶性乳腺癌肿瘤预测

基于逻辑回归的癌症预测案例

癌症分类预测-良／恶性乳腺癌肿瘤预测

数据介绍

原始数据的下载地址：https://archive.ics.uci.edu/ml/machine-learning-databases/（可以不通过手动下载，通过url的方法）

数据及源代码的Github下载地址：https://github.com/w1449550206/Cancer-prediction-based-on-logistic-regression.git

数据描述

（1）699条样本，共11列数据，第一列用语检索的id，后9列分别是与肿瘤

相关的医学特征，最后一列表示肿瘤类型的数值。

（2）包含16个缺失值，用”?”标出。

1 分析

1.获取数据（可以不通过手动下载，通过url的方法）
2.基本数据处理
2.1 缺失值处理
2.2 确定特征值,目标值
2.3 分割数据
3.特征工程标准化
4.机器学习逻辑回归的建模
5.模型评估二分类准确率精确率召回率 F1_score AUC

2 详细写代码步骤

import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

1.获取数据在线下载
2.数据基本处理缺失值处理，划分数据集
3.特征工程数据标准化
4.机器学习逻辑回归的建模
5.模型评估二分类准确率精确率召回率 F1_score AUC

2.1 获取数据在线下载

2.1.1 pandas在线下载数据

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data')

2.1.2 查看数据

data.head()

2.1.3 修改列的名字

#给列名字
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin','Normal Nucleoli', 'Mitoses', 'Class']

#给data增加一个names参数
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names = names)

data.head()

2.1.4 列名解释

Sample code number 样本编号用不到
Clump Thickness 肿瘤特征1
Uniformity of Cell Size 肿瘤特征2
Uniformity of Cell Shape 肿瘤特征3
Marginal Adhesion 肿瘤特征4
Single Epithelial Cell Size 肿瘤特征5
Bare Nuclei 肿瘤特征6
Bland Chromatin 肿瘤特征7
Normal Nucleoli 肿瘤特征8
Mitoses 肿瘤特征9
Class 肿瘤的种类

在实际工作中，要弄清楚每一个肿瘤特征代表什么含义，这样才能做好异常值缺失值的处理

2.1.5 查看目标值

data.Class#说明：2表示良性，4表示恶性

2.2 数据基本处理缺失值处理，划分数据集

2.2.1 ？标志的缺失值

#替换缺失值
data = data.replace(to_replace='?',value=np.nan)

## 删除缺失值的样本
data = data.dropna() #删除有np.nan的行

2.2.2 划分数据集

#特征值
x = data.iloc[:,1:10]
x.head()

#目标值
y = data['Class']
y.head()

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

2.3 特征工程标准化

transform = StandardScaler()#实例化转换器
#标准化
x_train = transform.fit_transform(x_train)
x_test = transform.fit_transform(x_test)

2.4 逻辑回归机器学习建模

2.4.1 建立模型

estimate = LogisticRegression()#用默认的就行

2.4.2 训练模型

estimate.fit(x_train,y_train)#得到了模型

2.5 模型评估

2.5.1 计算准确率

estimate.score(x_train,y_train)

estimate.score(x_test,y_test)

2.5.2 其他的评估方法

这个下一篇会详细讲解，但是在github的仓库中您可以看到。

3 代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# 1.获取数据
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
                  names=names)
data.head()
# 2.基本数据处理
# 2.1 缺失值处理
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna()
# 2.2 确定特征值,目标值
x = data.iloc[:, 1:10]
x.head()
y = data["Class"]
y.head()
# 2.3 分割数据
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)
# 3.特征工程(标准化)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4.机器学习(逻辑回归)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
# 5.模型评估
y_predict = estimator.predict(x_test)
y_predict
estimator.score(x_test, y_test)

在很多分类场景当中我们不一定只关注预测的准确率！！！！！

比如以这个癌症举例子！！！我们并不关注预测的准确率，而是关注在所有的样本当中，癌症患者有没有被全部预测（检测）出来。

机器学习之逻辑回归（三）：基于逻辑回归的癌症预测案例——【癌症分类预测-良／恶性乳腺癌肿瘤预测】