文章目录

  • 前期准备
  • 目标
  • 数据集介绍
  • 建模思路
  • 场景分析
  • 数据预处理
  • 导入库
  • 加载数据
  • 数据分析
  • 正负样本分布
  • 信用卡正常与被盗刷用户分析
  • 是否欺诈和交易金额关系分析
  • 消费和时间关系分析
  • V1-V28 字段分析
  • 特征工程
  • 特征重要性分析
  • 降维与聚类
  • 模型训练
  • 样本不平衡解决方法
  • SMOTE的基本原理
  • 样本不均衡过采样实现
  • 分类器进行训练
  • 构建训练集和测试集
  • 模型训练(baseline)
  • 模型优化
  • 绘制学习曲线
  • 模型评估
  • 混淆矩阵
  • 绘制 ROC曲线
  • 回顾总结
  • 参考资料

前期准备

目标

通过利用信用卡的历史交易数据,进行机器学习,构建信用卡反欺诈预测模型,提前发现客户信用卡被盗刷的事件。

数据集介绍

数据集(Credit Card Fraud Detection)包含由欧洲持卡人于2013年9月使用信用卡进行交的数据。此数据集显示两天内发生的交易,其中284,807笔交易中有492笔被盗刷。数据集非常不平衡,积极的类(被盗刷)占所有交易的0.172%。

信用卡欺诈检测问题的特点是样本的不均衡性,欺诈交易数量较少,所以可以训练一些不平衡样本的处理方式。

由于保密问题,无法提供有关数据的原始功能和更多背景信息。针对我们的目标,如果发生被盗刷,则取值1,否则为0。

建模思路

机器学习:04 Kaggle 信用卡欺诈_信用卡欺诈

场景分析

  • 数据是持卡人两天内信用卡交易数据,要解决的问题是预测持卡人是否会发生信用卡被盗刷
  • 判定信用卡持卡人是否会发生被盗刷是一个二元分类问题
  • 算法选择分类算法(例如:我们选择 Logistic Regression 作为我们的baseline)

提示: 特征V1至V28是经过PCA处理,而特征Time和Amount的数据规格与其他特征差别较大,需要对其做特征缩放,尤其是对大小分布敏感的算法(如LR)一定要进行缩放处理

Amount:可以直接缩放(0,1)

Time:数据提供单位秒,可以考虑转会成小时(对应每天的时间).

数据预处理

导入库

# Imports
# Numpy,Pandas
import numpy as np
import pandas as pd
import datetime

# matplotlib,seaborn,pyecharts
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


#  忽略弹出的warnings
import warnings
warnings.filterwarnings('ignore')  

pd.set_option('display.float_format', lambda x: '%.4f' % x)

加载数据

data_df = pd.read_csv("creditcard.csv")
print(data_df.shape)
data_df.head()
(284807, 31)



Time

V1

V2

V3

V4

V5

V6

V7

V8

V9

...

V21

V22

V23

V24

V25

V26

V27

V28

Amount

Class

0

0.0000

-1.3598

-0.0728

2.5363

1.3782

-0.3383

0.4624

0.2396

0.0987

0.3638

...

-0.0183

0.2778

-0.1105

0.0669

0.1285

-0.1891

0.1336

-0.0211

149.6200

0

1

0.0000

1.1919

0.2662

0.1665

0.4482

0.0600

-0.0824

-0.0788

0.0851

-0.2554

...

-0.2258

-0.6387

0.1013

-0.3398

0.1672

0.1259

-0.0090

0.0147

2.6900

0

2

1.0000

-1.3584

-1.3402

1.7732

0.3798

-0.5032

1.8005

0.7915

0.2477

-1.5147

...

0.2480

0.7717

0.9094

-0.6893

-0.3276

-0.1391

-0.0554

-0.0598

378.6600

0

3

1.0000

-0.9663

-0.1852

1.7930

-0.8633

-0.0103

1.2472

0.2376

0.3774

-1.3870

...

-0.1083

0.0053

-0.1903

-1.1756

0.6474

-0.2219

0.0627

0.0615

123.5000

0

4

2.0000

-1.1582

0.8777

1.5487

0.4030

-0.4072

0.0959

0.5929

-0.2705

0.8177

...

-0.0094

0.7983

-0.1375

0.1413

-0.2060

0.5023

0.2194

0.2152

69.9900

0

5 rows × 31 columns

从上面可以看出,数据为结构化数据,不需要抽特征转化

  • V1-V28都是一系列的指标(具体是什么不用知道):通过PCA 已经处理过的数据
  • Amount是交易金额:进行特征的缩放处理
  • 标签字段 Class=0表示是正常操作,而=1表示异常操作
data_df.info()# 查看数据的基本信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
data_df.describe().T#查看数据基本统计信息



count

mean

std

min

25%

50%

75%

max

Time

284807.0000

94813.8596

47488.1460

0.0000

54201.5000

84692.0000

139320.5000

172792.0000

V1

284807.0000

0.0000

1.9587

-56.4075

-0.9204

0.0181

1.3156

2.4549

V2

284807.0000

0.0000

1.6513

-72.7157

-0.5985

0.0655

0.8037

22.0577

V3

284807.0000

-0.0000

1.5163

-48.3256

-0.8904

0.1798

1.0272

9.3826

V4

284807.0000

0.0000

1.4159

-5.6832

-0.8486

-0.0198

0.7433

16.8753

V5

284807.0000

-0.0000

1.3802

-113.7433

-0.6916

-0.0543

0.6119

34.8017

V6

284807.0000

0.0000

1.3323

-26.1605

-0.7683

-0.2742

0.3986

73.3016

V7

284807.0000

-0.0000

1.2371

-43.5572

-0.5541

0.0401

0.5704

120.5895

V8

284807.0000

-0.0000

1.1944

-73.2167

-0.2086

0.0224

0.3273

20.0072

V9

284807.0000

-0.0000

1.0986

-13.4341

-0.6431

-0.0514

0.5971

15.5950

V10

284807.0000

0.0000

1.0888

-24.5883

-0.5354

-0.0929

0.4539

23.7451

V11

284807.0000

0.0000

1.0207

-4.7975

-0.7625

-0.0328

0.7396

12.0189

V12

284807.0000

-0.0000

0.9992

-18.6837

-0.4056

0.1400

0.6182

7.8484

V13

284807.0000

0.0000

0.9953

-5.7919

-0.6485

-0.0136

0.6625

7.1269

V14

284807.0000

0.0000

0.9586

-19.2143

-0.4256

0.0506

0.4931

10.5268

V15

284807.0000

0.0000

0.9153

-4.4989

-0.5829

0.0481

0.6488

8.8777

V16

284807.0000

0.0000

0.8763

-14.1299

-0.4680

0.0664

0.5233

17.3151

V17

284807.0000

-0.0000

0.8493

-25.1628

-0.4837

-0.0657

0.3997

9.2535

V18

284807.0000

0.0000

0.8382

-9.4987

-0.4988

-0.0036

0.5008

5.0411

V19

284807.0000

0.0000

0.8140

-7.2135

-0.4563

0.0037

0.4589

5.5920

V20

284807.0000

0.0000

0.7709

-54.4977

-0.2117

-0.0625

0.1330

39.4209

V21

284807.0000

0.0000

0.7345

-34.8304

-0.2284

-0.0295

0.1864

27.2028

V22

284807.0000

0.0000

0.7257

-10.9331

-0.5424

0.0068

0.5286

10.5031

V23

284807.0000

0.0000

0.6245

-44.8077

-0.1618

-0.0112

0.1476

22.5284

V24

284807.0000

0.0000

0.6056

-2.8366

-0.3546

0.0410

0.4395

4.5845

V25

284807.0000

0.0000

0.5213

-10.2954

-0.3171

0.0166

0.3507

7.5196

V26

284807.0000

0.0000

0.4822

-2.6046

-0.3270

-0.0521

0.2410

3.5173

V27

284807.0000

-0.0000

0.4036

-22.5657

-0.0708

0.0013

0.0910

31.6122

V28

284807.0000

-0.0000

0.3301

-15.4301

-0.0530

0.0112

0.0783

33.8478

Amount

284807.0000

88.3496

250.1201

0.0000

5.6000

22.0000

77.1650

25691.1600

Class

284807.0000

0.0017

0.0415

0.0000

0.0000

0.0000

0.0000

1.0000

特征Time的单为秒,我们将其转化为以小时为单位对应每天的时间

data_df['Hour'] = data_df['Time'].apply(lambda x:divmod(x,3600)[0])
data_df.sample(5)



Time

V1

V2

V3

V4

V5

V6

V7

V8

V9

...

V22

V23

V24

V25

V26

V27

V28

Amount

Class

Hour

265802

162055.0000

1.8019

-0.5296

-0.3982

0.5047

-0.7187

-0.7168

-0.2809

-0.2235

1.0216

...

0.8718

0.0374

0.1065

-0.1285

-0.2624

0.0251

-0.0156

106.7200

0

45.0000

126177

77952.0000

-1.2488

0.3134

0.3555

-0.7949

-1.0377

-0.6684

0.2091

0.0347

-1.2898

...

-0.3017

0.0967

0.0746

-0.6347

0.9844

-0.7203

-0.5310

100.0000

0

21.0000

163920

116322.0000

1.9908

-1.2415

-0.5690

-0.9741

-1.0472

-0.2112

-1.0302

-0.0320

-0.2351

...

1.2542

-0.0194

-0.4268

-0.1706

-0.0678

0.0017

-0.0431

95.0000

0

32.0000

190144

128705.0000

2.2632

-0.8175

-1.3416

-1.0346

-0.3259

-0.4674

-0.5986

-0.2146

-0.1352

...

0.4663

0.0271

-1.0325

0.0740

-0.0944

-0.0134

-0.0678

10.0000

0

35.0000

133830

80543.0000

-0.4457

0.3107

2.4817

0.1151

-0.4481

0.4889

-0.0565

0.2281

0.4648

...

0.3047

-0.0858

0.2381

-0.3820

0.2383

-0.2520

-0.1992

8.0400

0

22.0000

5 rows × 32 columns

data_df.columns
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class', 'Hour'],
      dtype='object')
x_feature = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount','Hour']
# 构建自变量和因变量
X = data_df[x_feature]
y = data_df["Class"]

数据分析

正负样本分布

Class=0为负样本(未被盗刷),Class=1的正样本(盗刷) ,看一下正负样本的数量.

data_df['Class'].value_counts()
0    284315
1       492
Name: Class, dtype: int64
# 目标变量分布可视化
fig, axs = plt.subplots(1,2,figsize=(14,7))
## 柱状图
sns.countplot(x='Class',data=data_df,ax=axs[0])
axs[0].set_title("Frequency of each Class")

## 圆形图
data_df['Class'].value_counts().plot(x=None,y=None, kind='pie', ax=axs[1],autopct='%1.2f%%')
axs[1].set_title("Percentage of each Class")
plt.show()

机器学习:04 Kaggle 信用卡欺诈_二分类_02

数据集284,807笔交易中有492笔是信用卡被盗刷交易,信用卡被盗刷交易占总体比例为0.17%
信用卡交易正常和被盗刷两者数量不平衡,样本不平衡影响分类器的学习,我们将会使用过采样的方法解决样本不平衡的问题。

信用卡正常与被盗刷用户分析

# 获取数据
fraud = data_df[data_df['Class'] == 1]
nonFraud = data_df[data_df['Class'] == 0]

# 相关性计算
correlationNonFraud = nonFraud.loc[:, data_df.columns != 'Class'].corr()
correlationFraud = fraud.loc[:, data_df.columns != 'Class'].corr()

# 上三角矩阵设置
mask = np.zeros_like(correlationNonFraud)# 全部设置0
indices = np.triu_indices_from(correlationNonFraud)#返回函数的上三角矩阵
mask[indices] = True
grid_kws = {"width_ratios": (.9, .9, .05), "wspace": 0.2}
f, (ax1, ax2, cbar_ax) = plt.subplots(1, 3, gridspec_kw=grid_kws, figsize = (14, 9))

# 正常用户-特征相关性展示
cmap = sns.diverging_palette(220, 8, as_cmap=True)
ax1 =sns.heatmap(correlationNonFraud, ax = ax1, vmin = -1, vmax = 1, \
    cmap = cmap, square = False, linewidths = 0.5, mask = mask, cbar = False)
ax1.set_xticklabels(ax1.get_xticklabels(), size = 16); 
ax1.set_yticklabels(ax1.get_yticklabels(), size = 16); 
ax1.set_title('Normal', size = 20)

# 被欺诈的用户-特征相关性展示
ax2 = sns.heatmap(correlationFraud, vmin = -1, vmax = 1, cmap = cmap, \
ax = ax2, square = False, linewidths = 0.5, mask = mask, yticklabels = False, \
    cbar_ax = cbar_ax, cbar_kws={'orientation': 'vertical', \
                                 'ticks': [-1, -0.5, 0, 0.5, 1]})
ax2.set_xticklabels(ax2.get_xticklabels(), size = 16); 
ax2.set_title('Fraud', size = 20);

机器学习:04 Kaggle 信用卡欺诈_信用卡欺诈_03

从上图可以看出,信用卡被盗刷的事件中,部分变量之间的相关性更明显。

其中变量V1、V2、V3、V4、V5、V6、V7、V9、V10、V11、V12、V14、V16、V17和V18以及V19之间的变化在信用卡被盗刷的样本中呈性一定的规律。

是否欺诈和交易金额关系分析

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(16,4))
bins = 30
ax1.hist(data_df["Amount"][data_df["Class"]== 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(data_df["Amount"][data_df["Class"] == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()

机器学习:04 Kaggle 信用卡欺诈_Time_04

信用卡被盗刷发生的金额与信用卡正常用户发生的金额相比呈现散而小的特点

这说明信用卡盗刷者为了不引起信用卡卡主的注意,更偏向选择小金额消费。

消费和时间关系分析

# 每个小时交易次数
sns.factorplot(x="Hour", data=data_df, kind="count", size=6, aspect=3)
<seaborn.axisgrid.FacetGrid at 0x1f6f9550>

机器学习:04 Kaggle 信用卡欺诈_机器学习_05

数据是2天内容的数据:对应的时间Hour范围在0-48 ,上图发现 每天早上9点到晚上11点之间是信用卡消费的高频时间段

V1-V28 字段分析

# 获取V1-V28 字段

v_feat_col = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15',
         'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
v_feat_col_size = len(v_feat_col)


plt.figure(figsize=(16,v_feat_col_size*4))
gs = gridspec.GridSpec(v_feat_col_size, 1)
for i, cn in enumerate(data_df[v_feat_col]):
    ax = plt.subplot(gs[i])
    sns.distplot(data_df[cn][data_df["Class"] == 1], bins=50)# V1 异常  绿色表示
    sns.distplot(data_df[cn][data_df["Class"] == 0], bins=100)# V1 正常  橘色表示
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))

机器学习:04 Kaggle 信用卡欺诈_Time_06

不同信用卡状态(1-盗刷;0-正常)下的分布有明显区别的变量,选择有明显区分度的特征。
从上述图分析:因此剔除变量V8、V13 、V15 、V20 、V21 、V22、 V23 、V24 、V25 、V26 、V27 和V28变量 (这些特征不能很好的区分类别)

data_df.head()



Time

V1

V2

V3

V4

V5

V6

V7

V8

V9

...

V22

V23

V24

V25

V26

V27

V28

Amount

Class

Hour

0

0.0000

-1.3598

-0.0728

2.5363

1.3782

-0.3383

0.4624

0.2396

0.0987

0.3638

...

0.2778

-0.1105

0.0669

0.1285

-0.1891

0.1336

-0.0211

149.6200

0

0.0000

1

0.0000

1.1919

0.2662

0.1665

0.4482

0.0600

-0.0824

-0.0788

0.0851

-0.2554

...

-0.6387

0.1013

-0.3398

0.1672

0.1259

-0.0090

0.0147

2.6900

0

0.0000

2

1.0000

-1.3584

-1.3402

1.7732

0.3798

-0.5032

1.8005

0.7915

0.2477

-1.5147

...

0.7717

0.9094

-0.6893

-0.3276

-0.1391

-0.0554

-0.0598

378.6600

0

0.0000

3

1.0000

-0.9663

-0.1852

1.7930

-0.8633

-0.0103

1.2472

0.2376

0.3774

-1.3870

...

0.0053

-0.1903

-1.1756

0.6474

-0.2219

0.0627

0.0615

123.5000

0

0.0000

4

2.0000

-1.1582

0.8777

1.5487

0.4030

-0.4072

0.0959

0.5929

-0.2705

0.8177

...

0.7983

-0.1375

0.1413

-0.2060

0.5023

0.2194

0.2152

69.9900

0

0.0000

5 rows × 32 columns

# 同时删除Time:保留Hour字段
droplist = ['V8', 'V13', 'V15', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28','Time']
data_df_new = data_df.drop(droplist, axis = 1)
print(data_df_new.shape) #特征从31个缩减至18个(不含目标变量)
data_df_new.tail()
(284807, 19)



V1

V2

V3

V4

V5

V6

V7

V9

V10

V11

V12

V14

V16

V17

V18

V19

Amount

Class

Hour

284802

-11.8811

10.0718

-9.8348

-2.0667

-5.3645

-2.6068

-4.9182

1.9144

4.3562

-1.5931

2.7119

4.6269

1.1076

1.9917

0.5106

-0.6829

0.7700

0

47.0000

284803

-0.7328

-0.0551

2.0350

-0.7386

0.8682

1.0584

0.0243

0.5848

-0.9759

-0.1502

0.9158

-0.6751

-0.7118

-0.0257

-1.2212

-1.5456

24.7900

0

47.0000

284804

1.9196

-0.3013

-3.2496

-0.5578

2.6305

3.0313

-0.2968

0.4325

-0.4848

0.4116

0.0631

-0.5106

0.1407

0.3135

0.3957

-0.5773

67.8800

0

47.0000

284805

-0.2404

0.5305

0.7025

0.6898

-0.3780

0.6237

-0.6862

0.3921

-0.3991

-1.9338

-0.9629

0.4496

-0.6086

0.5099

1.1140

2.8978

10.0000

0

47.0000

284806

-0.5334

-0.1897

0.7033

-0.5063

-0.0125

-0.6496

1.5770

0.4862

-0.9154

-1.0405

-0.0315

-0.0843

-0.3026

-0.6604

0.1674

-0.2561

217.0000

0

47.0000

特征工程

特征Hour和Amount的规格和其他特征相差较大,其进行特征缩放

# 对Amount和Hour 进行特征缩放
col = ['Amount','Hour']
from sklearn.preprocessing import StandardScaler # 导入模块
sc =StandardScaler() # 初始化缩放器 作用:去均值和方差归一化。且是针对每一个特征维度来做的,而不是针对样本
data_df_new[col] =sc.fit_transform(data_df_new[col])#对数据进行标准化
data_df_new.tail()



V1

V2

V3

V4

V5

V6

V7

V9

V10

V11

V12

V14

V16

V17

V18

V19

Amount

Class

Hour

284802

-11.8811

10.0718

-9.8348

-2.0667

-5.3645

-2.6068

-4.9182

1.9144

4.3562

-1.5931

2.7119

4.6269

1.1076

1.9917

0.5106

-0.6829

-0.3502

0

1.6044

284803

-0.7328

-0.0551

2.0350

-0.7386

0.8682

1.0584

0.0243

0.5848

-0.9759

-0.1502

0.9158

-0.6751

-0.7118

-0.0257

-1.2212

-1.5456

-0.2541

0

1.6044

284804

1.9196

-0.3013

-3.2496

-0.5578

2.6305

3.0313

-0.2968

0.4325

-0.4848

0.4116

0.0631

-0.5106

0.1407

0.3135

0.3957

-0.5773

-0.0818

0

1.6044

284805

-0.2404

0.5305

0.7025

0.6898

-0.3780

0.6237

-0.6862

0.3921

-0.3991

-1.9338

-0.9629

0.4496

-0.6086

0.5099

1.1140

2.8978

-0.3132

0

1.6044

284806

-0.5334

-0.1897

0.7033

-0.5063

-0.0125

-0.6496

1.5770

0.4862

-0.9154

-1.0405

-0.0315

-0.0843

-0.3026

-0.6604

0.1674

-0.2561

0.5144

0

1.6044

data_df_new.describe().T



count

mean

std

min

25%

50%

75%

max

V1

284807.0000

0.0000

1.9587

-56.4075

-0.9204

0.0181

1.3156

2.4549

V2

284807.0000

0.0000

1.6513

-72.7157

-0.5985

0.0655

0.8037

22.0577

V3

284807.0000

-0.0000

1.5163

-48.3256

-0.8904

0.1798

1.0272

9.3826

V4

284807.0000

0.0000

1.4159

-5.6832

-0.8486

-0.0198

0.7433

16.8753

V5

284807.0000

-0.0000

1.3802

-113.7433

-0.6916

-0.0543

0.6119

34.8017

V6

284807.0000

0.0000

1.3323

-26.1605

-0.7683

-0.2742

0.3986

73.3016

V7

284807.0000

-0.0000

1.2371

-43.5572

-0.5541

0.0401

0.5704

120.5895

V9

284807.0000

-0.0000

1.0986

-13.4341

-0.6431

-0.0514

0.5971

15.5950

V10

284807.0000

0.0000

1.0888

-24.5883

-0.5354

-0.0929

0.4539

23.7451

V11

284807.0000

0.0000

1.0207

-4.7975

-0.7625

-0.0328

0.7396

12.0189

V12

284807.0000

-0.0000

0.9992

-18.6837

-0.4056

0.1400

0.6182

7.8484

V14

284807.0000

0.0000

0.9586

-19.2143

-0.4256

0.0506

0.4931

10.5268

V16

284807.0000

0.0000

0.8763

-14.1299

-0.4680

0.0664

0.5233

17.3151

V17

284807.0000

-0.0000

0.8493

-25.1628

-0.4837

-0.0657

0.3997

9.2535

V18

284807.0000

0.0000

0.8382

-9.4987

-0.4988

-0.0036

0.5008

5.0411

V19

284807.0000

0.0000

0.8140

-7.2135

-0.4563

0.0037

0.4589

5.5920

Amount

284807.0000

0.0000

1.0000

-0.3532

-0.3308

-0.2653

-0.0447

102.3622

Class

284807.0000

0.0017

0.0415

0.0000

0.0000

0.0000

0.0000

1.0000

Hour

284807.0000

-0.0000

1.0000

-1.9603

-0.8226

-0.2158

0.9218

1.6044

特征重要性分析

利用随机森林的feature importance对特征的重要性进行排序

x_feature = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17', 'V18', 'V19', 'Amount',  'Hour']
x_val = data_df_new[x_feature]
y_val = data_df_new['Class']
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=10,random_state=123,max_depth=4)#构建分类随机森林分类器
clf.fit(x_val, y_val) #对自变量和因变量进行拟合
RandomForestClassifier(max_depth=4, n_estimators=10, random_state=123)
for feature in zip(x_feature,clf.feature_importances_):
    print(feature)
('V1', 0.0008826091438778425)
('V2', 0.0021058185061093608)
('V3', 0.009750867340434583)
('V4', 0.01751094043420745)
('V5', 0.008600547467227002)
('V6', 0.013298075656335426)
('V7', 0.0086835897086001)
('V9', 0.023090145788325165)
('V10', 0.08528888657921369)
('V11', 0.06537921978883558)
('V12', 0.14194613523236163)
('V14', 0.13109127164220205)
('V16', 0.19729822871872432)
('V17', 0.27966491161168533)
('V18', 0.009405287105749225)
('V19', 0.0002669771829968763)
('Amount', 0.0017493348363684953)
('Hour', 0.003987153256745854)
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12,6)

## feature importances 可视化##
importances = clf.feature_importances_
feat_names = data_df_new[x_feature].columns
indices = np.argsort(importances)[::-1]
fig = plt.figure(figsize=(20,6))
plt.title("Feature importances by RandomTreeClassifier")

x = list(range(len(indices)))

plt.bar(x, importances[indices], color='lightblue',  align="center")
plt.step(x, np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(x, feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
(-1, 18)

机器学习:04 Kaggle 信用卡欺诈_Time_07

from sklearn import tree
# 从随机森林抽取单棵树
estimator = clf.estimators_[5]

#  决策数可视化参考:https://blog.csdn.net/shenfuli/article/details/108492095
# 导入可视化工具类
import pydotplus
from IPython.display import display, Image

# 注意,根据不同系统安装Graphviz2
import os       
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

dot_data = tree.export_graphviz(estimator, 
                                out_file=None, 
                                feature_names=x_feature,
                                class_names = ['0-normal', '1-fraud'],
                                filled = True,
                                rounded =True
                               )
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))

机器学习:04 Kaggle 信用卡欺诈_机器学习_08

降维与聚类

理解t-SNE(需要掌握下面内容)

  • Euclidean Distance( 欧式距离 )
  • Conditional Probability(条件概率)
  • Normal and T-Distribution Plots( 正态分布和T分布 )

结论

  • t-SNE算法可以很准确地将数据集中的欺诈和非欺诈案例进行聚类
  • 虽然子样本很小,但t-SNE算法在每个场景中都能非常准确地检测到集群(在运行t-SNE之前,我会对数据集进行洗牌)
  • 这表明,进一步的预测模型在区分欺诈案件和非欺诈案件方面将表现得相当好。
# Lets shuffle the data before creating the subsamples
df = data_df_new.sample(frac=1)
# amount of fraud classes 492 rows.
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df = df.loc[df['Class'] == 0][:492]

normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)
print(new_df.shape)
new_df.head()
(984, 19)



V1

V2

V3

V4

V5

V6

V7

V9

V10

V11

V12

V14

V16

V17

V18

V19

Amount

Class

Hour

147662

2.0090

-0.4316

-1.7964

0.0436

0.5059

0.1105

-0.0201

0.6397

0.2503

-0.3630

-0.1701

0.7224

0.3486

-0.7336

0.1952

0.8910

-0.1528

0

-0.1400

95534

1.1939

-0.5711

0.7425

-0.0146

-0.6246

0.8322

-0.8334

1.1694

-0.3717

-0.2457

1.3759

-0.8193

0.1259

-0.3972

0.2724

1.2260

-0.2257

1

-0.5951

38764

1.1490

-0.2724

0.2268

0.7082

-0.4065

-0.1700

-0.1213

0.7598

-0.2049

-1.6016

-0.4125

0.0845

0.1235

-0.2379

-0.2917

0.5235

-0.0534

0

-1.2018

252774

-1.2014

4.8645

-8.3288

7.6524

-0.1674

-2.7677

-3.1764

-4.3672

-5.5334

4.1064

-6.3318

-12.1566

-2.1109

-1.5585

0.1960

0.5025

-0.3502

1

1.3011

15225

-19.8563

12.0959

-22.4641

6.1155

-15.1480

-4.3467

-15.6485

-3.9742

-8.8592

5.7308

-8.0880

-8.5790

-6.9477

-13.4729

-4.9402

1.2301

0.0465

1

-1.4293

import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA,TruncatedSVD

X = new_df.drop('Class', axis=1)
y = new_df['Class']

# T-SNE Implementation
t0 = time.time()
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("T-SNE took {:.2} s".format(t1 - t0))

# PCA Implementation
t0 = time.time()
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("PCA took {:.2} s".format(t1 - t0))

# TruncatedSVD
t0 = time.time()
X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)
t1 = time.time()
print("Truncated SVD took {:.2} s".format(t1 - t0))
T-SNE took 1.1e+01 s
PCA took 0.003 s
Truncated SVD took 0.004 s
import matplotlib.patches as mpatches

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle('Clusters using Dimensionality Reduction', fontsize=14)


blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')


# t-SNE scatter plot
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax1.set_title('t-SNE', fontsize=14)

ax1.grid(True)

ax1.legend(handles=[blue_patch, red_patch])


# PCA scatter plot
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax2.set_title('PCA', fontsize=14)

ax2.grid(True)

ax2.legend(handles=[blue_patch, red_patch])

# TruncatedSVD scatter plot
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax3.set_title('Truncated SVD', fontsize=14)

ax3.grid(True)

ax3.legend(handles=[blue_patch, red_patch])

plt.show()

机器学习:04 Kaggle 信用卡欺诈_机器学习_09

模型训练

样本不平衡解决方法

样本不平衡常用的解决方法:本项目方案(1-欺诈 0-正常)我们需要对1-欺诈数据进行过采样

  • 过采样(oversampling),增加正样本使得正、负样本数目接近,然后再进行学习。
  • 欠采样(undersampling),去除一些负样本使得正、负样本数目接近,然后再进行学习

过采样方法具体操作使用SMOTE(Synthetic Minority Oversampling Technique)

SMOTE的基本原理

SMOTE(Synthetic Minority Oversampling Technique): 合成少数类过采样技术。

针对python提供了SMOTE算法库(通过 pip install -U imbalanced-learn 进行算法包安装)

from imblearn.over_sampling import SMOTE # 导入SMOTE算法模块

样本不均衡过采样实现

# 构建自变量和因变量
X = data_df[x_feature]
y = data_df["Class"]

n_sample = y.shape[0]
n_pos_sample = y[y == 1].shape[0]
n_neg_sample = y[y == 0].shape[0]
print('样本个数:{}; 正样本占{:.2%}; 负样本占{:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))
print('特征维数:', X.shape[1])
样本个数:284807; 正样本占0.17%; 负样本占99.83%
特征维数: 18
from imblearn.over_sampling import SMOTE # 导入SMOTE算法模块
# 处理不平衡数据
sm = SMOTE(random_state=42)    # 处理过采样的方法
X, y = sm.fit_sample(X, y)
print('通过SMOTE方法平衡正负样本后')
n_sample = y.shape[0]
n_pos_sample = y[y == 1].shape[0]
n_neg_sample = y[y == 0].shape[0]
print('样本个数:{}; 正样本占{:.2%}; 负样本占{:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))
print('特征维数:', X.shape[1])
通过SMOTE方法平衡正负样本后
样本个数:568630; 正样本占50.00%; 负样本占50.00%
特征维数: 18

分类器进行训练

构建训练集和测试集

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify = y,test_size= 0.3,random_state=42)
len(X_train),len(X_test)
(398041, 170589)

模型训练(baseline)

#help(LogisticRegression)
# 模型训练
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression() # 构建逻辑回归分类器
lr.fit(X_train, y_train)

# 测试集预测
y_pred = lr.predict(X_test)

# 模型评估
from sklearn.metrics import confusion_matrix,classification_report
print('<--------Confusion Matrix-------->\n',confusion_matrix(y_test,y_pred))
print('<--------Classification Report-------->\n',classification_report(y_test,y_pred))
<--------Confusion Matrix-------->
 [[84062  1233]
 [ 5712 79582]]
<--------Classification Report-------->
               precision    recall  f1-score   support

           0       0.94      0.99      0.96     85295
           1       0.98      0.93      0.96     85294

    accuracy                           0.96    170589
   macro avg       0.96      0.96      0.96    170589
weighted avg       0.96      0.96      0.96    170589

模型优化

模型调优采用网格搜索调优参数(grid search)-> 获取模型训练最佳参数

通过help(LogisticRegression) 或者 官方文档查知参数

init__(self, penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1,
		class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto',
		verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
# 构建参数组合
param_grid = {'C': [0.1, 1, 10,100],# 一般经验10倍增加
                            'penalty': [ 'l1', 'l2']}

clf = GridSearchCV(LogisticRegression(),  param_grid, cv=5)
clf.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.1, 1, 10, 100], 'penalty': ['l1', 'l2']})
clf.best_params_
{'C': 10, 'penalty': 'l2'}
# 测试集预测
y_pred = clf.predict(X_test)

# 模型评估
from sklearn.metrics import confusion_matrix,classification_report
print('<--------Confusion Matrix-------->\n',confusion_matrix(y_test,y_pred))
print('<--------Classification Report-------->\n',classification_report(y_test,y_pred))
<--------Confusion Matrix-------->
 [[84049  1246]
 [ 5782 79512]]
<--------Classification Report-------->
               precision    recall  f1-score   support

           0       0.94      0.99      0.96     85295
           1       0.98      0.93      0.96     85294

    accuracy                           0.96    170589
   macro avg       0.96      0.96      0.96    170589
weighted avg       0.96      0.96      0.96    170589

绘制学习曲线

Grid Search帮你挑参数还是蛮方便的,你也可以大胆放心地在刚才其他的模型上试一把。

而且要看看模型状态是不是,过拟合or欠拟合

依旧是学习曲线

看出来了吧,训练集和测试集间隔很小,效果不错

from sklearn.model_selection import ShuffleSplit 
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    f, ax1 = plt.subplots(1,1, figsize=(10,6), sharey=True)
    if ylim is not None:
        plt.ylim(*ylim)
    # First Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
    ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
    ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
    ax1.set_title("Logistic Regression Learning Curve", fontsize=14)
    ax1.set_xlabel('Training size (m)')
    ax1.set_ylabel('Score')
    ax1.grid(True)
    ax1.legend(loc="best")

    return plt

title = "Learning Curves (lr C:10, penalty: l2})"

estimator = LogisticRegression(penalty='l2', C=10.0)# 提供的最优参数,训练模型查看是否过拟合

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
plot_learning_curve(estimator,  X, y, (0.87, 1.01), cv=cv, n_jobs=4)
<module 'matplotlib.pyplot' from 'D:\\opt\\anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

机器学习:04 Kaggle 信用卡欺诈_数据_10

模型评估

混淆矩阵

解决不同的问题,通常需要不同的指标来度量模型的性能。
例如我们希望用算法来预测信用卡是否是欺诈的,假设100条交易中有5条数据是欺诈,对于风控来说,尽可能提高模型的查全率(recall)比提高查准率(precision)更为重要,因为站在风控的角度,发生漏发现欺诈比发生误判更为严重。

import itertools
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
from sklearn.metrics import confusion_matrix


y_pred_proba = clf.predict_proba(X_test)  #predict_prob 获得一个概率值
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]  # 设定不同阈值
plt.figure(figsize=(15,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_proba[:,1] > i#预测出来的概率值是否大于阈值 
    plt.subplot(3,3,j)# 3 * 3 第三行和第三列的图,j表示第几个图表
    j += 1
    cnf_matrix = confusion_matrix(y_test, y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
    
    x1 = cnf_matrix[1,1]# 正样本中预测也是正样本
    x2 = (cnf_matrix[1,0]+cnf_matrix[1,1])# 所有正样本
    print("threshold:{},Recall metric in the testing dataset {}->{}->{} ".format( i, x1/x2,x1,x2))
    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix ,classes=class_names)
threshold:0.1,Recall metric in the testing dataset 0.9827772176237485->83825->85294 
threshold:0.2,Recall metric in the testing dataset 0.9658709874082585->82383->85294 
threshold:0.3,Recall metric in the testing dataset 0.9521771754167937->81215->85294 
threshold:0.4,Recall metric in the testing dataset 0.9416606091870472->80318->85294 
threshold:0.5,Recall metric in the testing dataset 0.9322109409806082->79512->85294 
threshold:0.6,Recall metric in the testing dataset 0.9277674865758435->79133->85294 
threshold:0.7,Recall metric in the testing dataset 0.9218936853706005->78632->85294 
threshold:0.8,Recall metric in the testing dataset 0.9142612610500153->77981->85294 
threshold:0.9,Recall metric in the testing dataset 0.9019391750885174->76930->85294

机器学习:04 Kaggle 信用卡欺诈_机器学习_11

绘制 ROC曲线

from itertools import cycle

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal', 'red', 'yellow', 'green', 'blue','black'])

plt.figure(figsize=(12,7))

j = 1
for i,color in zip(thresholds,colors):
    y_test_predictions_prob = y_pred_proba[:,1] > i #预测出来的概率值是否大于阈值  

    precision, recall, thresholds = precision_recall_curve(y_test, y_test_predictions_prob)
    area = auc(recall, precision)# recall ,precision 组成的面积
    
    # Plot Precision-Recall curve
    plt.plot(recall, precision, color=color,
                 label='Threshold: %s, AUC=%0.5f' %(i , area))
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('Precision-Recall Curve')
    plt.legend(loc="lower left")

机器学习:04 Kaggle 信用卡欺诈_机器学习_12

通过PRC曲线,获取的信息如下:

  • precision和recall是一组矛盾的变量。
  • 从上面混淆矩阵和PRC曲线可以看到,阈值越小,recall值越大,模型能找出信用卡被盗刷的数量也就更多,但换来的代价是误判的数量也较大。
  • 随着阈值的提高,recall值逐渐降低,precision值也逐渐提高,误判的数量也随之减少。
  • 通过调整模型阈值,控制模型反信用卡欺诈的力度,若想找出更多的信用卡被盗刷就设置较小的阈值,反之,则设置较大的阈值

回顾总结

  • 模型评估指标,什么用召回率?什么时候用准确率

没有固定的标准,例如:我们在新闻闻本分类,希望预测的新闻的类别准确高即可。

然而在信用卡欺诈这种,我们更期望召回更多欺诈data(哪怕错误召回呢,我们也近可能多的召回欺诈数据)

  • 分类场景样本不均衡:本案例中针对正样本不足的数据,采用SMOTE算法进行过采样
  • 二分类分类中,预测一个样本可能性。如何设置阈值没有固定的标准,更多的结合业务来判断(因为不同的阈值,对召回率和精确率是有影响的),就看我们的业务到底希望提升那个指标为参考。例如:信用卡欺诈这种业务,更希望召回率高些(意思就是把可能欺诈交易全部拦截)
  • 针对二分类可能传统的机器学习或者深度学习,我们这里选择机器学习并且采用LR作为我们的baseline的模型(可以有效解释那些特征好用,业务解释性强)
  • 针对这类任务,发现特征工程重要性,尤其V1-V28 这种数据我们可以分析,直接影响模型的效果,总之,数据数据太重要了

阿里云国内75折 回扣 微信号:monov8
阿里云国际,腾讯云国际,低至75折。AWS 93折 免费开户实名账号 代冲值 优惠多多 微信号:monov8 飞机:@monov6
标签: 机器学习