机器学习实战系列[一]:工业蒸汽量预测(最新版本上篇)含数据探索特征工程等

机器学习实战系列[一]：工业蒸汽量预测

背景介绍

1.数据探索性分析

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import warnings
warnings.filterwarnings("ignore"
 
%matplotlib inline

# 下载需要用到的数据集
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt

--2023-03-23 18:10:23--  http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com|49.7.22.39|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 466959 (456K [text/plain]
正在保存至: “zhengqi_test.txt.1”

zhengqi_test.txt.1  100%[===================>] 456.01K  --.-KB/s    in 0.04s   

2023-03-23 18:10:23 (10.0 MB/s - 已保存 “zhengqi_test.txt.1” [466959/466959]

--2023-03-23 18:10:23--  http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com|49.7.22.39|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 714370 (698K [text/plain]
正在保存至: “zhengqi_train.txt.1”

zhengqi_train.txt.1 100%[===================>] 697.63K  --.-KB/s    in 0.04s   

2023-03-23 18:10:24 (17.9 MB/s - 已保存 “zhengqi_train.txt.1” [714370/714370]

# **读取数据文件**
# 使用Pandas库`read_csv(`函数进行数据读取，分割符为‘\t’
train_data_file = "./zhengqi_train.txt"
test_data_file =  "./zhengqi_test.txt"

train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8'
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8'

1.1 查看数据信息

#查看特征变量信息
train_data.info(

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns:
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 3   V3      2888 non-null   float64
 4   V4      2888 non-null   float64
 5   V5      2888 non-null   float64
 6   V6      2888 non-null   float64
 7   V7      2888 non-null   float64
 8   V8      2888 non-null   float64
 9   V9      2888 non-null   float64
 10  V10     2888 non-null   float64
 11  V11     2888 non-null   float64
 12  V12     2888 non-null   float64
 13  V13     2888 non-null   float64
 14  V14     2888 non-null   float64
 15  V15     2888 non-null   float64
 16  V16     2888 non-null   float64
 17  V17     2888 non-null   float64
 18  V18     2888 non-null   float64
 19  V19     2888 non-null   float64
 20  V20     2888 non-null   float64
 21  V21     2888 non-null   float64
 22  V22     2888 non-null   float64
 23  V23     2888 non-null   float64
 24  V24     2888 non-null   float64
 25  V25     2888 non-null   float64
 26  V26     2888 non-null   float64
 27  V27     2888 non-null   float64
 28  V28     2888 non-null   float64
 29  V29     2888 non-null   float64
 30  V30     2888 non-null   float64
 31  V31     2888 non-null   float64
 32  V32     2888 non-null   float64
 33  V33     2888 non-null   float64
 34  V34     2888 non-null   float64
 35  V35     2888 non-null   float64
 36  V36     2888 non-null   float64
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39
memory usage: 880.1 KB

此训练集数据共有2888个样本，数据中有V0-V37共计38个特征变量，变量类型都为数值类型，所有数据特征没有缺失值数据；
数据字段由于采用了脱敏处理，删除了特征数据的具体含义；target字段为标签变量

test_data.info(

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns:
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 3   V3      1925 non-null   float64
 4   V4      1925 non-null   float64
 5   V5      1925 non-null   float64
 6   V6      1925 non-null   float64
 7   V7      1925 non-null   float64
 8   V8      1925 non-null   float64
 9   V9      1925 non-null   float64
 10  V10     1925 non-null   float64
 11  V11     1925 non-null   float64
 12  V12     1925 non-null   float64
 13  V13     1925 non-null   float64
 14  V14     1925 non-null   float64
 15  V15     1925 non-null   float64
 16  V16     1925 non-null   float64
 17  V17     1925 non-null   float64
 18  V18     1925 non-null   float64
 19  V19     1925 non-null   float64
 20  V20     1925 non-null   float64
 21  V21     1925 non-null   float64
 22  V22     1925 non-null   float64
 23  V23     1925 non-null   float64
 24  V24     1925 non-null   float64
 25  V25     1925 non-null   float64
 26  V26     1925 non-null   float64
 27  V27     1925 non-null   float64
 28  V28     1925 non-null   float64
 29  V29     1925 non-null   float64
 30  V30     1925 non-null   float64
 31  V31     1925 non-null   float64
 32  V32     1925 non-null   float64
 33  V33     1925 non-null   float64
 34  V34     1925 non-null   float64
 35  V35     1925 non-null   float64
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38
memory usage: 571.6 KB

测试集数据共有1925个样本，数据中有V0-V37共计38个特征变量，变量类型都为数值类型

# 查看数据统计信息
train_data.describe(

	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
count	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	...	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000
mean	0.123048	0.056068	0.289720	-0.067790	0.012921	-0.558565	0.182892	0.116155	0.177856	-0.169452	...	0.097648	0.055477	0.127791	0.020806	0.007801	0.006715	0.197764	0.030658	-0.130330	0.126353
std	0.928031	0.941515	0.911236	0.970298	0.888377	0.517957	0.918054	0.955116	0.895444	0.953813	...	1.061200	0.901934	0.873028	0.902584	1.006995	1.003291	0.985675	0.970812	1.017196	0.983966
min	-4.335000	-5.122000	-3.420000	-3.956000	-4.742000	-2.182000	-4.576000	-5.048000	-4.692000	-12.891000	...	-2.912000	-4.507000	-5.859000	-4.053000	-4.627000	-4.789000	-5.695000	-2.608000	-3.630000	-3.044000
25%	-0.297000	-0.226250	-0.313000	-0.652250	-0.385000	-0.853000	-0.310000	-0.295000	-0.159000	-0.390000	...	-0.664000	-0.283000	-0.170250	-0.407250	-0.499000	-0.290000	-0.202500	-0.413000	-0.798250	-0.350250
50%	0.359000	0.272500	0.386000	-0.044500	0.110000	-0.466000	0.388000	0.344000	0.362000	0.042000	...	-0.023000	0.053500	0.299500	0.039000	-0.040000	0.160000	0.364000	0.137000	-0.185500	0.313000
75%	0.726000	0.599000	0.918250	0.624000	0.550250	-0.154000	0.831250	0.782250	0.726000	0.042000	...	0.745250	0.488000	0.635000	0.557000	0.462000	0.273000	0.602000	0.644250	0.495250	0.793250
max	2.121000	1.918000	2.828000	2.457000	2.689000	0.489000	1.895000	1.918000	2.245000	1.335000	...	4.580000	2.689000	2.013000	2.395000	5.465000	5.110000	2.324000	5.238000	3.000000	2.538000

8 rows × 39 columns

test_data.describe(

	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37
count	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	...	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000	1925.000000
mean	-0.184404	-0.083912	-0.434762	0.101671	-0.019172	0.838049	-0.274092	-0.173971	-0.266709	0.255114	...	-0.206871	-0.146463	-0.083215	-0.191729	-0.030782	-0.011433	-0.009985	-0.296895	-0.046270	0.195735
std	1.073333	1.076670	0.969541	1.034925	1.147286	0.963043	1.054119	1.040101	1.085916	1.014394	...	1.064140	0.880593	1.126414	1.138454	1.130228	0.989732	0.995213	0.946896	1.040854	0.940599
min	-4.814000	-5.488000	-4.283000	-3.276000	-4.921000	-1.168000	-5.649000	-5.625000	-6.059000	-6.784000	...	-2.435000	-2.413000	-4.507000	-7.698000	-4.057000	-4.627000	-4.789000	-7.477000	-2.608000	-3.346000
25%	-0.664000	-0.451000	-0.978000	-0.644000	-0.497000	0.122000	-0.732000	-0.509000	-0.775000	-0.390000	...	-0.453000	-0.818000	-0.339000	-0.476000	-0.472000	-0.460000	-0.290000	-0.349000	-0.593000	-0.432000
50%	0.065000	0.195000	-0.267000	0.220000	0.118000	0.437000	-0.082000	0.018000	-0.004000	0.401000	...	-0.445000	-0.199000	0.010000	0.100000	0.155000	-0.040000	0.160000	-0.270000	0.083000	0.152000
75%	0.549000	0.589000	0.278000	0.793000	0.610000	1.928000	0.457000	0.515000	0.482000	0.904000	...	-0.434000	0.468000	0.447000	0.471000	0.627000	0.419000	0.273000	0.364000	0.651000	0.797000
max	2.100000	2.120000	1.946000	2.603000	4.475000	3.176000	1.528000	1.394000	2.408000	1.766000	...	4.656000	3.022000	3.139000	1.428000	2.299000	5.465000	5.110000	1.671000	2.861000	3.021000

8 rows × 38 columns

# 查看数据字段信息
train_data.head(

	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
0	0.566	0.016	-0.143	0.407	0.452	-0.901	-1.812	-2.360	-0.436	-2.114	...	0.136	0.109	-0.615	0.327	-4.627	-4.789	-5.101	-2.608	-3.508	0.175
1	0.968	0.437	0.066	0.566	0.194	-0.893	-1.566	-2.360	0.332	-2.114	...	-0.128	0.124	0.032	0.600	-0.843	0.160	0.364	-0.335	-0.730	0.676
2	1.013	0.568	0.235	0.370	0.112	-0.797	-1.367	-2.360	0.396	-2.114	...	-0.009	0.361	0.277	-0.116	-0.843	0.160	0.364	0.765	-0.589	0.633
3	0.733	0.368	0.283	0.165	0.599	-0.679	-1.200	-2.086	0.403	-2.114	...	0.015	0.417	0.279	0.603	-0.843	-0.065	0.364	0.333	-0.112	0.206
4	0.684	0.638	0.260	0.209	0.337	-0.454	-1.073	-2.086	0.314	-2.114	...	0.183	1.078	0.328	0.418	-0.843	-0.215	0.364	-0.280	-0.028	0.384

5 rows × 39 columns

test_data.head(

	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37
0	0.368	0.380	-0.225	-0.049	0.379	0.092	0.550	0.551	0.244	0.904	...	-0.449	0.047	0.057	-0.042	0.847	0.534	-0.009	-0.190	-0.567	0.388
1	0.148	0.489	-0.247	-0.049	0.122	-0.201	0.487	0.493	-0.127	0.904	...	-0.443	0.047	0.560	0.176	0.551	0.046	-0.220	0.008	-0.294	0.104
2	-0.166	-0.062	-0.311	0.046	-0.055	0.063	0.485	0.493	-0.227	0.904	...	-0.458	-0.398	0.101	0.199	0.634	0.017	-0.234	0.008	0.373	0.569
3	0.102	0.294	-0.259	0.051	-0.183	0.148	0.474	0.504	0.010	0.904	...	-0.456	-0.398	1.007	0.137	1.042	-0.040	-0.290	0.008	-0.666	0.391
4	0.300	0.428	0.208	0.051	-0.033	0.116	0.408	0.497	0.155	0.904	...	-0.458	-0.776	0.291	0.370	0.181	-0.040	-0.290	0.008	-0.140	-0.497

5 rows × 38 columns

1.2 可视化探索数据

fig = plt.figure(figsize=(4, 6  # 指定绘图对象宽度和高度
sns.boxplot(train_data['V0'],orient="v", width=0.5

<matplotlib.axes._subplots.AxesSubplot at 0x7faf89f46950>

# 画箱式图
# column = train_data.columns.tolist([:39]  # 列表头
# fig = plt.figure(figsize=(20, 40  # 指定绘图对象宽度和高度
# for i in range(38:
#     plt.subplot(13, 3, i + 1  # 13行3列子图
#     sns.boxplot(train_data[column[i]], orient="v", width=0.5  # 箱式图
#     plt.ylabel(column[i], fontsize=8
# plt.show(
#箱图自行打开

查看数据分布图

查看特征变量‘V0’的数据分布直方图，并绘制Q-Q图查看数据是否近似于正态分布

plt.figure(figsize=(10,5

ax=plt.subplot(1,2,1
sns.distplot(train_data['V0'],fit=stats.norm
ax=plt.subplot(1,2,2
res = stats.probplot(train_data['V0'], plot=plt

# train_cols = 6
# train_rows = len(train_data.columns
# plt.figure(figsize=(4*train_cols,4*train_rows

# i=0
# for col in train_data.columns:
#     i+=1
#     ax=plt.subplot(train_rows,train_cols,i
#     sns.distplot(train_data[col],fit=stats.norm
    
#     i+=1
#     ax=plt.subplot(train_rows,train_cols,i
#     res = stats.probplot(train_data[col], plot=plt
# plt.show(
#QQ图自行打开

由上面的数据分布图信息可以看出，很多特征变量（如'V1','V9','V24','V28'等）的数据分布不是正态的，数据并不跟随对角线，后续可以使用数据变换对数据进行转换。

ax = sns.kdeplot(train_data['V0'], color="Red", shade=True
ax = sns.kdeplot(test_data['V0'], color="Blue", shade=True
ax.set_xlabel('V0'
ax.set_ylabel("Frequency"
ax = ax.legend(["train","test"]

# dist_cols = 6
# dist_rows = len(test_data.columns
# plt.figure(figsize=(4*dist_cols,4*dist_rows

# i=1
# for col in test_data.columns:
#     ax=plt.subplot(dist_rows,dist_cols,i
#     ax = sns.kdeplot(train_data[col], color="Red", shade=True
#     ax = sns.kdeplot(test_data[col], color="Blue", shade=True
#     ax.set_xlabel(col
#     ax.set_ylabel("Frequency"
#     ax = ax.legend(["train","test"]
    
#     i+=1
# plt.show(
#自行打开

查看特征'V5', 'V17', 'V28', 'V22', 'V11', 'V9'数据的数据分布

drop_col = 6
drop_row = 1

plt.figure(figsize=(5*drop_col,5*drop_row

i=1
for col in ["V5","V9","V11","V17","V22","V28"]:
    ax =plt.subplot(drop_row,drop_col,i
    ax = sns.kdeplot(train_data[col], color="Red", shade=True
    ax = sns.kdeplot(test_data[col], color="Blue", shade=True
    ax.set_xlabel(col
    ax.set_ylabel("Frequency"
    ax = ax.legend(["train","test"]
    
    i+=1
plt.show(

drop_columns = ['V5','V9','V11','V17','V22','V28']
# 合并训练集和测试集数据，并可视化训练集和测试集数据特征分布图

可视化线性回归关系

查看特征变量‘V0’与'target'变量的线性回归关系

fcols = 2
frows = 1

plt.figure(figsize=(8,4

ax=plt.subplot(1,2,1
sns.regplot(x='V0', y='target', data=train_data, ax=ax, 
            scatter_kws={'marker':'.','s':3,'alpha':0.3},
            line_kws={'color':'k'};
plt.xlabel('V0'
plt.ylabel('target'

ax=plt.subplot(1,2,2
sns.distplot(train_data['V0'].dropna(
plt.xlabel('V0'

plt.show(

1.2.2 查看变量间线性回归关系

# fcols = 6
# frows = len(test_data.columns
# plt.figure(figsize=(5*fcols,4*frows

# i=0
# for col in test_data.columns:
#     i+=1
#     ax=plt.subplot(frows,fcols,i
#     sns.regplot(x=col, y='target', data=train_data, ax=ax, 
#                 scatter_kws={'marker':'.','s':3,'alpha':0.3},
#                 line_kws={'color':'k'};
#     plt.xlabel(col
#     plt.ylabel('target'
    
#     i+=1
#     ax=plt.subplot(frows,fcols,i
#     sns.distplot(train_data[col].dropna(
    # plt.xlabel(col
    #已注释图片生成，自行打开

1.2.2 查看特征变量的相关性


data_train1 = train_data.drop(['V5','V9','V11','V17','V22','V28'],axis=1
train_corr = data_train1.corr(
train_corr

	V0	V1	V2	V3	V4	V6	V7	V8	V10	V12	...	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
V0	1.000000	0.908607	0.463643	0.409576	0.781212	0.189267	0.141294	0.794013	0.298443	0.751830	...	0.302145	0.156968	0.675003	0.050951	0.056439	-0.019342	0.138933	0.231417	-0.494076	0.873212
V1	0.908607	1.000000	0.506514	0.383924	0.657790	0.276805	0.205023	0.874650	0.310120	0.656186	...	0.147096	0.175997	0.769745	0.085604	0.035129	-0.029115	0.146329	0.235299	-0.494043	0.871846
V2	0.463643	0.506514	1.000000	0.410148	0.057697	0.615938	0.477114	0.703431	0.346006	0.059941	...	-0.275764	0.175943	0.653764	0.033942	0.050309	-0.025620	0.043648	0.316462	-0.734956	0.638878
V3	0.409576	0.383924	0.410148	1.000000	0.315046	0.233896	0.197836	0.411946	0.321262	0.306397	...

编程笔记 » 机器学习实战系列[一]:工业蒸汽量预测(最新版本上篇)含数据探索特征工程等