Wisconsin州乳腺癌数据集和诊断预测

2025-04-172025-04-15 作者金龟子

一、问题描述

机器学习和深度学习技术在医疗保健领域中的应用正在稳步增长。如果你想练习并了解使用此类数据的效果，这个数据集是一个不错的选择。数据来自美国Wisconsin/威斯康星州的乳腺癌诊断数据集。乳腺癌数据这是肿瘤研究所提供的三个领域之一，这些领域反复出现在机器学习文献中。医疗人员采集了患者乳腺肿块经过细针穿刺 (FNA) 后的数字化图像，并且对这些数字图像进行了特征提取，这些特征可以描述图像中的细胞核呈现。肿瘤可以分成良性和恶性。这个医疗保健数据集适合练习分类和随机森林、SVM 等算法。

二、数据集内容

该数据集由 569 个样本组成，其中包括 357 个良性样本和 212 个恶性样本。这个数据集中有三类特征，其中实值特征最有趣。它们是从数字化图像中计算出来的，包含有关区域、细胞半径、纹理等信息。时间跨度为截至1988-07-11。

数据结构

ID: number
diagnosis: 乳腺组织的诊断（M = 恶性，B = 良性）/ The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean: 从中心到周边点距离的平均值/mean of distances from center to points on the perimeter:
texture_mean: 灰度值的标准差/standard deviation of gray-scale values
perimeter_mean: 核心肿瘤的平均大小/mean size of the core tumor
area_mean:
smoothness_mean: 平滑度平均值/mean of local variation in radius lengths
compactness_mean: 周长^2 / 面积 - 1.0 的平均值/ mean of perimeter^2 / area - 1.0
concavity_mean: 轮廓凹陷部分的严重程度的平均值/mean of severity of concave portions of the contour:
concave points_mean: 轮廓凹陷部分数量的平均值/mean for number of concave portions of the contour
symmetry_mean: 平滑度平均值
fractal_dimension_mean: “海岸线近似” - 1 的平均值/mean for "coastline approximation" - 1
radius_se: 从中心到周边点距离的平均值的标准误差/standard error for the mean of distances from center to points on the perimeter
texture_se: 灰度值的标准差的标准误差/standard error for standard deviation of gray-scale values
perimeter_searea_se
smoothness_se: 半径长度局部变化的标准误差/standard error for local variation in radius lengths
compactness_se: 周长^2 / 面积 - 1.0 的标准误差/standard error for perimeter^2 / area - 1.0
concavity_se: 轮廓凹陷部分的严重程度的标准误差/standard error for severity of concave portions of the contour
concave: 
points_se: 轮廓凹陷部分数量的标准误差/standard error for number of concave portions of the contour
symmetry_se:
fractal_dimension_se: “海岸线近似” - 1 的标准误差/ standard error for "coastline approximation" - 1
radius_worst: “最差”或最大的从中心到周边点距离的平均值 / "worst" or largest mean value for mean of distances from center to points on the perimeter
texture_worst: “最差”或最大的灰度值的标准差 / "worst" or largest mean value for standard deviation of gray-scale values
perimeter_worst: 
area_worst
smoothness_worst: “最差”或最大的半径长度局部变化的平均值 / "worst" or largest mean value for local variation in radius lengths
compactness_worst: “最差”或最大的周长^2 / 面积 - 1.0 的平均值 / "worst" or largest mean value for perimeter^2 / area - 1.0
concavity_worst: “最差”或最大的轮廓凹陷部分的严“最差”或最大的周长^2 / 面积 - 1.0 的平均值 / 重程度的平均值/"worst" or largest mean value for severity of concave portions of the contour
concave 
points_worst：“最差”或最大的轮廓凹陷部分数量的平均值/"worst" or largest mean value for number of concave portions of the contour
symmetry_worst：
fractal_dimension_worst：“海岸线近似” - 1 的“最差”或最大值/"worst" or largest mean value for "coastline approximation" - 1

数据样例

id	842302
diagnosis	M
radius_mean	17.99
texture_mean	10.38
perimeter_mean	122.8
area_mean	1001
smoothness_mean	0.1184
compactness_mean	0.2776
concavity_mean	0.3001
concave points_mean	0.1471
symmetry_mean	0.2419
fractal_dimension_mean	0.07871
radius_se	1.095
texture_se	0.9053
perimeter_se	8.589
area_se	153.4
smoothness_se	0.006399
compactness_se	0.04904
concavity_se	0.05373
concave points_se	0.01587
symmetry_se	0.03003
fractal_dimension_se	0.006193
radius_worst	25.38
texture_worst	17.33
perimeter_worst	184.6
area_worst	2019
smoothness_worst	0.1622
compactness_worst	0.6656
concavity_worst	0.7119
concave points_worst	0.2654
symmetry_worst	0.4601
fractal_dimension_worst	0.1189

数据来源

Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology
University Medical Center
Ljubljana, Yugoslavia

数据集使用许可协议

CC BY-NC-SA 4.0

三、机器学习算法

安装

安装后的开发包版本信息：

python                    3.12.7               h14ffc60_0    anaconda
numpy                     1.26.4          py312hfd52020_0    anaconda
scipy                     1.13.1          py312hbb039d4_0    anaconda
pandas                    2.2.2           py312h0158946_0    anaconda
scikit-learn              1.5.1           py312h0158946_0    anaconda
matplotlib                3.9.2           py312haa95532_0    anaconda
seaborn                   0.13.2          py312haa95532_0    anaconda
statsmodels               0.14.2          py312h4b0e54e_0    anaconda

总结

使用这30个组织学肿瘤特征，乳腺癌诊断可以预测大约95%的准确率。
表现最好的机器学习算法包括线性模型：逻辑回归和线性判别分析，以及两种集成算法：随机森林和集成投票分类器。
使用递归特征消除和交叉验证（以逻辑回归作为分类器）发现的最具预测性的特征是：平均纹理、周长和面积，以及最差纹理、周长和面积。
使用随机森林分类的最具预测性的特征是：平均纹理周长和面积、分形维数，以及最差面积周长。

四、获取案例套件

需要登录后才允许下载文件包。登录

发表评论取消回复

要发表评论，您必须先登录。