机器学习实战系列[一]:工业蒸汽量预测
- 背景介绍
- 相关描述
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。
- 数据说明
- 结果评估
预测结果以mean square error作为评判标准。
1.数据探索性分析
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore"
%matplotlib inline
# 下载需要用到的数据集
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
--2023-03-23 18:10:23-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com|49.7.22.39|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 466959 (456K [text/plain]
正在保存至: “zhengqi_test.txt.1”
zhengqi_test.txt.1 100%[===================>] 456.01K --.-KB/s in 0.04s
2023-03-23 18:10:23 (10.0 MB/s - 已保存 “zhengqi_test.txt.1” [466959/466959]
--2023-03-23 18:10:23-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com|49.7.22.39|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 714370 (698K [text/plain]
正在保存至: “zhengqi_train.txt.1”
zhengqi_train.txt.1 100%[===================>] 697.63K --.-KB/s in 0.04s
2023-03-23 18:10:24 (17.9 MB/s - 已保存 “zhengqi_train.txt.1” [714370/714370]
# **读取数据文件**
# 使用Pandas库`read_csv(`函数进行数据读取,分割符为‘\t’
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8'
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8'
1.1 查看数据信息
#查看特征变量信息
train_data.info(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V0 2888 non-null float64
1 V1 2888 non-null float64
2 V2 2888 non-null float64
3 V3 2888 non-null float64
4 V4 2888 non-null float64
5 V5 2888 non-null float64
6 V6 2888 non-null float64
7 V7 2888 non-null float64
8 V8 2888 non-null float64
9 V9 2888 non-null float64
10 V10 2888 non-null float64
11 V11 2888 non-null float64
12 V12 2888 non-null float64
13 V13 2888 non-null float64
14 V14 2888 non-null float64
15 V15 2888 non-null float64
16 V16 2888 non-null float64
17 V17 2888 non-null float64
18 V18 2888 non-null float64
19 V19 2888 non-null float64
20 V20 2888 non-null float64
21 V21 2888 non-null float64
22 V22 2888 non-null float64
23 V23 2888 non-null float64
24 V24 2888 non-null float64
25 V25 2888 non-null float64
26 V26 2888 non-null float64
27 V27 2888 non-null float64
28 V28 2888 non-null float64
29 V29 2888 non-null float64
30 V30 2888 non-null float64
31 V31 2888 non-null float64
32 V32 2888 non-null float64
33 V33 2888 non-null float64
34 V34 2888 non-null float64
35 V35 2888 non-null float64
36 V36 2888 non-null float64
37 V37 2888 non-null float64
38 target 2888 non-null float64
dtypes: float64(39
memory usage: 880.1 KB
此训练集数据共有2888个样本,数据中有V0-V37共计38个特征变量,变量类型都为数值类型,所有数据特征没有缺失值数据;
数据字段由于采用了脱敏处理,删除了特征数据的具体含义;target字段为标签变量
test_data.info(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V0 1925 non-null float64
1 V1 1925 non-null float64
2 V2 1925 non-null float64
3 V3 1925 non-null float64
4 V4 1925 non-null float64
5 V5 1925 non-null float64
6 V6 1925 non-null float64
7 V7 1925 non-null float64
8 V8 1925 non-null float64
9 V9 1925 non-null float64
10 V10 1925 non-null float64
11 V11 1925 non-null float64
12 V12 1925 non-null float64
13 V13 1925 non-null float64
14 V14 1925 non-null float64
15 V15 1925 non-null float64
16 V16 1925 non-null float64
17 V17 1925 non-null float64
18 V18 1925 non-null float64
19 V19 1925 non-null float64
20 V20 1925 non-null float64
21 V21 1925 non-null float64
22 V22 1925 non-null float64
23 V23 1925 non-null float64
24 V24 1925 non-null float64
25 V25 1925 non-null float64
26 V26 1925 non-null float64
27 V27 1925 non-null float64
28 V28 1925 non-null float64
29 V29 1925 non-null float64
30 V30 1925 non-null float64
31 V31 1925 non-null float64
32 V32 1925 non-null float64
33 V33 1925 non-null float64
34 V34 1925 non-null float64
35 V35 1925 non-null float64
36 V36 1925 non-null float64
37 V37 1925 non-null float64
dtypes: float64(38
memory usage: 571.6 KB
测试集数据共有1925个样本,数据中有V0-V37共计38个特征变量,变量类型都为数值类型
# 查看数据统计信息
train_data.describe(
V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | ... | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 | 2888.000000 |
mean | 0.123048 | 0.056068 | 0.289720 | -0.067790 | 0.012921 | -0.558565 | 0.182892 | 0.116155 | 0.177856 | -0.169452 | ... | 0.097648 | 0.055477 | 0.127791 | 0.020806 | 0.007801 | 0.006715 | 0.197764 | 0.030658 | -0.130330 | 0.126353 |
std | 0.928031 | 0.941515 | 0.911236 | 0.970298 | 0.888377 | 0.517957 | 0.918054 | 0.955116 | 0.895444 | 0.953813 | ... | 1.061200 | 0.901934 | 0.873028 | 0.902584 | 1.006995 | 1.003291 | 0.985675 | 0.970812 | 1.017196 | 0.983966 |
min | -4.335000 | -5.122000 | -3.420000 | -3.956000 | -4.742000 | -2.182000 | -4.576000 | -5.048000 | -4.692000 | -12.891000 | ... | -2.912000 | -4.507000 | -5.859000 | -4.053000 | -4.627000 | -4.789000 | -5.695000 | -2.608000 | -3.630000 | -3.044000 |
25% | -0.297000 | -0.226250 | -0.313000 | -0.652250 | -0.385000 | -0.853000 | -0.310000 | -0.295000 | -0.159000 | -0.390000 | ... | -0.664000 | -0.283000 | -0.170250 | -0.407250 | -0.499000 | -0.290000 | -0.202500 | -0.413000 | -0.798250 | -0.350250 |
50% | 0.359000 | 0.272500 | 0.386000 | -0.044500 | 0.110000 | -0.466000 | 0.388000 | 0.344000 | 0.362000 | 0.042000 | ... | -0.023000 | 0.053500 | 0.299500 | 0.039000 | -0.040000 | 0.160000 | 0.364000 | 0.137000 | -0.185500 | 0.313000 |
75% | 0.726000 | 0.599000 | 0.918250 | 0.624000 | 0.550250 | -0.154000 | 0.831250 | 0.782250 | 0.726000 | 0.042000 | ... | 0.745250 | 0.488000 | 0.635000 | 0.557000 | 0.462000 | 0.273000 | 0.602000 | 0.644250 | 0.495250 | 0.793250 |
max | 2.121000 | 1.918000 | 2.828000 | 2.457000 | 2.689000 | 0.489000 | 1.895000 | 1.918000 | 2.245000 | 1.335000 | ... | 4.580000 | 2.689000 | 2.013000 | 2.395000 | 5.465000 | 5.110000 | 2.324000 | 5.238000 | 3.000000 | 2.538000 |
8 rows × 39 columns
test_data.describe(
V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | ... | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 |
mean | -0.184404 | -0.083912 | -0.434762 | 0.101671 | -0.019172 | 0.838049 | -0.274092 | -0.173971 | -0.266709 | 0.255114 | ... | -0.206871 | -0.146463 | -0.083215 | -0.191729 | -0.030782 | -0.011433 | -0.009985 | -0.296895 | -0.046270 | 0.195735 |
std | 1.073333 | 1.076670 | 0.969541 | 1.034925 | 1.147286 | 0.963043 | 1.054119 | 1.040101 | 1.085916 | 1.014394 | ... | 1.064140 | 0.880593 | 1.126414 | 1.138454 | 1.130228 | 0.989732 | 0.995213 | 0.946896 | 1.040854 | 0.940599 |
min | -4.814000 | -5.488000 | -4.283000 | -3.276000 | -4.921000 | -1.168000 | -5.649000 | -5.625000 | -6.059000 | -6.784000 | ... | -2.435000 | -2.413000 | -4.507000 | -7.698000 | -4.057000 | -4.627000 | -4.789000 | -7.477000 | -2.608000 | -3.346000 |
25% | -0.664000 | -0.451000 | -0.978000 | -0.644000 | -0.497000 | 0.122000 | -0.732000 | -0.509000 | -0.775000 | -0.390000 | ... | -0.453000 | -0.818000 | -0.339000 | -0.476000 | -0.472000 | -0.460000 | -0.290000 | -0.349000 | -0.593000 | -0.432000 |
50% | 0.065000 | 0.195000 | -0.267000 | 0.220000 | 0.118000 | 0.437000 | -0.082000 | 0.018000 | -0.004000 | 0.401000 | ... | -0.445000 | -0.199000 | 0.010000 | 0.100000 | 0.155000 | -0.040000 | 0.160000 | -0.270000 | 0.083000 | 0.152000 |
75% | 0.549000 | 0.589000 | 0.278000 | 0.793000 | 0.610000 | 1.928000 | 0.457000 | 0.515000 | 0.482000 | 0.904000 | ... | -0.434000 | 0.468000 | 0.447000 | 0.471000 | 0.627000 | 0.419000 | 0.273000 | 0.364000 | 0.651000 | 0.797000 |
max | 2.100000 | 2.120000 | 1.946000 | 2.603000 | 4.475000 | 3.176000 | 1.528000 | 1.394000 | 2.408000 | 1.766000 | ... | 4.656000 | 3.022000 | 3.139000 | 1.428000 | 2.299000 | 5.465000 | 5.110000 | 1.671000 | 2.861000 | 3.021000 |
8 rows × 38 columns
# 查看数据字段信息
train_data.head(
V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.566 | 0.016 | -0.143 | 0.407 | 0.452 | -0.901 | -1.812 | -2.360 | -0.436 | -2.114 | ... | 0.136 | 0.109 | -0.615 | 0.327 | -4.627 | -4.789 | -5.101 | -2.608 | -3.508 | 0.175 |
1 | 0.968 | 0.437 | 0.066 | 0.566 | 0.194 | -0.893 | -1.566 | -2.360 | 0.332 | -2.114 | ... | -0.128 | 0.124 | 0.032 | 0.600 | -0.843 | 0.160 | 0.364 | -0.335 | -0.730 | 0.676 |
2 | 1.013 | 0.568 | 0.235 | 0.370 | 0.112 | -0.797 | -1.367 | -2.360 | 0.396 | -2.114 | ... | -0.009 | 0.361 | 0.277 | -0.116 | -0.843 | 0.160 | 0.364 | 0.765 | -0.589 | 0.633 |
3 | 0.733 | 0.368 | 0.283 | 0.165 | 0.599 | -0.679 | -1.200 | -2.086 | 0.403 | -2.114 | ... | 0.015 | 0.417 | 0.279 | 0.603 | -0.843 | -0.065 | 0.364 | 0.333 | -0.112 | 0.206 |
4 | 0.684 | 0.638 | 0.260 | 0.209 | 0.337 | -0.454 | -1.073 | -2.086 | 0.314 | -2.114 | ... | 0.183 | 1.078 | 0.328 | 0.418 | -0.843 | -0.215 | 0.364 | -0.280 | -0.028 | 0.384 |
5 rows × 39 columns
test_data.head(
V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.368 | 0.380 | -0.225 | -0.049 | 0.379 | 0.092 | 0.550 | 0.551 | 0.244 | 0.904 | ... | -0.449 | 0.047 | 0.057 | -0.042 | 0.847 | 0.534 | -0.009 | -0.190 | -0.567 | 0.388 |
1 | 0.148 | 0.489 | -0.247 | -0.049 | 0.122 | -0.201 | 0.487 | 0.493 | -0.127 | 0.904 | ... | -0.443 | 0.047 | 0.560 | 0.176 | 0.551 | 0.046 | -0.220 | 0.008 | -0.294 | 0.104 |
2 | -0.166 | -0.062 | -0.311 | 0.046 | -0.055 | 0.063 | 0.485 | 0.493 | -0.227 | 0.904 | ... | -0.458 | -0.398 | 0.101 | 0.199 | 0.634 | 0.017 | -0.234 | 0.008 | 0.373 | 0.569 |
3 | 0.102 | 0.294 | -0.259 | 0.051 | -0.183 | 0.148 | 0.474 | 0.504 | 0.010 | 0.904 | ... | -0.456 | -0.398 | 1.007 | 0.137 | 1.042 | -0.040 | -0.290 | 0.008 | -0.666 | 0.391 |
4 | 0.300 | 0.428 | 0.208 | 0.051 | -0.033 | 0.116 | 0.408 | 0.497 | 0.155 | 0.904 | ... | -0.458 | -0.776 | 0.291 | 0.370 | 0.181 | -0.040 | -0.290 | 0.008 | -0.140 | -0.497 |
5 rows × 38 columns
1.2 可视化探索数据
fig = plt.figure(figsize=(4, 6 # 指定绘图对象宽度和高度
sns.boxplot(train_data['V0'],orient="v", width=0.5
<matplotlib.axes._subplots.AxesSubplot at 0x7faf89f46950>
# 画箱式图
# column = train_data.columns.tolist([:39] # 列表头
# fig = plt.figure(figsize=(20, 40 # 指定绘图对象宽度和高度
# for i in range(38:
# plt.subplot(13, 3, i + 1 # 13行3列子图
# sns.boxplot(train_data[column[i]], orient="v", width=0.5 # 箱式图
# plt.ylabel(column[i], fontsize=8
# plt.show(
#箱图自行打开
查看数据分布图
- 查看特征变量‘V0’的数据分布直方图,并绘制Q-Q图查看数据是否近似于正态分布
plt.figure(figsize=(10,5
ax=plt.subplot(1,2,1
sns.distplot(train_data['V0'],fit=stats.norm
ax=plt.subplot(1,2,2
res = stats.probplot(train_data['V0'], plot=plt
# train_cols = 6
# train_rows = len(train_data.columns
# plt.figure(figsize=(4*train_cols,4*train_rows
# i=0
# for col in train_data.columns:
# i+=1
# ax=plt.subplot(train_rows,train_cols,i
# sns.distplot(train_data[col],fit=stats.norm
# i+=1
# ax=plt.subplot(train_rows,train_cols,i
# res = stats.probplot(train_data[col], plot=plt
# plt.show(
#QQ图自行打开
由上面的数据分布图信息可以看出,很多特征变量(如'V1','V9','V24','V28'等)的数据分布不是正态的,数据并不跟随对角线,后续可以使用数据变换对数据进行转换。
ax = sns.kdeplot(train_data['V0'], color="Red", shade=True
ax = sns.kdeplot(test_data['V0'], color="Blue", shade=True
ax.set_xlabel('V0'
ax.set_ylabel("Frequency"
ax = ax.legend(["train","test"]
# dist_cols = 6
# dist_rows = len(test_data.columns
# plt.figure(figsize=(4*dist_cols,4*dist_rows
# i=1
# for col in test_data.columns:
# ax=plt.subplot(dist_rows,dist_cols,i
# ax = sns.kdeplot(train_data[col], color="Red", shade=True
# ax = sns.kdeplot(test_data[col], color="Blue", shade=True
# ax.set_xlabel(col
# ax.set_ylabel("Frequency"
# ax = ax.legend(["train","test"]
# i+=1
# plt.show(
#自行打开
查看特征'V5', 'V17', 'V28', 'V22', 'V11', 'V9'数据的数据分布
drop_col = 6
drop_row = 1
plt.figure(figsize=(5*drop_col,5*drop_row
i=1
for col in ["V5","V9","V11","V17","V22","V28"]:
ax =plt.subplot(drop_row,drop_col,i
ax = sns.kdeplot(train_data[col], color="Red", shade=True
ax = sns.kdeplot(test_data[col], color="Blue", shade=True
ax.set_xlabel(col
ax.set_ylabel("Frequency"
ax = ax.legend(["train","test"]
i+=1
plt.show(
drop_columns = ['V5','V9','V11','V17','V22','V28']
# 合并训练集和测试集数据,并可视化训练集和测试集数据特征分布图
可视化线性回归关系
- 查看特征变量‘V0’与'target'变量的线性回归关系
fcols = 2
frows = 1
plt.figure(figsize=(8,4
ax=plt.subplot(1,2,1
sns.regplot(x='V0', y='target', data=train_data, ax=ax,
scatter_kws={'marker':'.','s':3,'alpha':0.3},
line_kws={'color':'k'};
plt.xlabel('V0'
plt.ylabel('target'
ax=plt.subplot(1,2,2
sns.distplot(train_data['V0'].dropna(
plt.xlabel('V0'
plt.show(
1.2.2 查看变量间线性回归关系
# fcols = 6
# frows = len(test_data.columns
# plt.figure(figsize=(5*fcols,4*frows
# i=0
# for col in test_data.columns:
# i+=1
# ax=plt.subplot(frows,fcols,i
# sns.regplot(x=col, y='target', data=train_data, ax=ax,
# scatter_kws={'marker':'.','s':3,'alpha':0.3},
# line_kws={'color':'k'};
# plt.xlabel(col
# plt.ylabel('target'
# i+=1
# ax=plt.subplot(frows,fcols,i
# sns.distplot(train_data[col].dropna(
# plt.xlabel(col
#已注释图片生成,自行打开
1.2.2 查看特征变量的相关性
data_train1 = train_data.drop(['V5','V9','V11','V17','V22','V28'],axis=1
train_corr = data_train1.corr(
train_corr
V0 | V1 | V2 | V3 | V4 | V6 | V7 | V8 | V10 | V12 | ... | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V0 | 1.000000 | 0.908607 | 0.463643 | 0.409576 | 0.781212 | 0.189267 | 0.141294 | 0.794013 | 0.298443 | 0.751830 | ... | 0.302145 | 0.156968 | 0.675003 | 0.050951 | 0.056439 | -0.019342 | 0.138933 | 0.231417 | -0.494076 | 0.873212 |
V1 | 0.908607 | 1.000000 | 0.506514 | 0.383924 | 0.657790 | 0.276805 | 0.205023 | 0.874650 | 0.310120 | 0.656186 | ... | 0.147096 | 0.175997 | 0.769745 | 0.085604 | 0.035129 | -0.029115 | 0.146329 | 0.235299 | -0.494043 | 0.871846 |
V2 | 0.463643 | 0.506514 | 1.000000 | 0.410148 | 0.057697 | 0.615938 | 0.477114 | 0.703431 | 0.346006 | 0.059941 | ... | -0.275764 | 0.175943 | 0.653764 | 0.033942 | 0.050309 | -0.025620 | 0.043648 | 0.316462 | -0.734956 | 0.638878 |
V3 | 0.409576 | 0.383924 | 0.410148 | 1.000000 | 0.315046 | 0.233896 | 0.197836 | 0.411946 | 0.321262 | 0.306397 | ... |