淘宝网广告展示和点击日志(2017)和分析样例

一、问题描述

大数据时代的背景下，广告投放成为了互联网各个行业中运营推广的主流。对于电商行业来说，广告投放的效果，取决于广告投放后为平台带来了多少转化。要有转化，首先就要有流量（点击），因此，如何对广告进行精准投放，提高广告的点击率，进而实现精准营销就显得尤为重要。

二、数据集内容

淘宝网广告展示和点击日志(2017)数据来源为：从淘宝网站上随机抽取了1,140,000名用户，为期8天的广告展示/点击日志（2600万条记录），形成原始样本骨架。使用7天的样本作为训练数据集，时间范围为2017-05-06到2017-05-12。最后一天的数据样本作为测试数据集（2017-05-13）。

raw_sample.csv

字段描述：

(1) user：用户ID（整数）；
(2) time_stamp：时间戳（Bigint，1494032110代表2017-05-06 08:55:10）；
(3) adgroup_id：广告组ID（整数）；
(4) pid：场景；
(5) nonclk：1表示未点击，0表示点击；
(6) clk：为0代表没有点击；为1代表点击

如果我们使用userID和时间戳作为主键，我们会发现很多重复记录。这是因为不同类型的数据行为是从不同部门收集的，当打包在一起时，会有小的偏差（即相同的两个时间戳可能是两个不同的时间，差异相对较小）。

数据样例：

user	time_stamp	adgroup_id	pid	nonclk
581738	1494137644	1	430548_1007	1
449818	1494638778	3	430548_1007	1
914836	1494650879	4	430548_1007	1
914836	1494651029	5	430548_1007	1
399907	1494302958	8	430548_1007	1
628137	1494524935	9	430548_1007	1
298139	1494462593	9	430539_1007	1

ad_feature.csv

该数据集涵盖了用户广告点击日志raw_sample中的字段adgroup_id关联的所有广告的基本信息。

字段定义：

(1) adgroup_id：广告ID（整数）；
(2) cate_id：类别ID；
(3) campaign_id：活动ID；
(4) brand：品牌ID；
(5) customer_id：广告主ID；
(6) price：商品价格 
一个广告ID对应一个商品，一个商品属于一个类别，一个商品属于一个品牌。

数据样例：

adgroup_id	cate_id	campaign_id	customer	brand	price
63133	6406	83237	1	95471	170
313401	6406	83237	1	87331	199
248909	392	83237	1	32233	38
208458	392	83237	1	174374	139
110847	7211	135256	2	145952	32.99
607788	6261	387991	6	207800	199
375706	4520	387991	6	NULL	99
11115	7213	139747	9	186847	33
24484	7207	139744	9	186847	19
28589	5953	395195	13	NULL	428
23236	5953	395195	13	NULL	368
300556	5953	395195	13	NULL	639
92560	5953	395195	13	NULL	368
590965	4284	28145	14	454237	249

user_profile.csv

该数据集涵盖了用户广告点击日志raw_sample的字段user关联的1,060,000名用户的基本信息。

字段定义：

(1) userid：用户ID；
(2) cms_segid：微组ID；
(3) cms_group_id：cms_group_id；
(4) final_gender_code：性别 1表示男性，2表示女性；
(5) age_level：年龄等级；
(6) pvalue_level：消费等级，1：低，2：中，3：高；
(7) shopping_level：购物深度，1：浅层用户，2：中层用户，3：深度用户；
(8) occupation：是否为大学生 1：是，0：否？
(9) new_user_class_level：城市等级。

数据样例：

userid	cms_segid	cms_group_id	final_gender_code	age_level	pvalue_level	shopping_level	occupation	new_user_class_level
234	0	5	2	5		3	0	3
523	5	2	2	2	1	3	1	2
612	0	8	1	2	2	3	0
1670	0	4	2	4		1	0
2545	0	10	1	4		3	0
3644	49	6	2	6	2	3	0	2
5777	44	5	2	5	2	3	0	2
6211	0	9	1	3		3	0	2
6355	2	1	2	1	1	3	0	4

数据集版权许可协议

数据提供方：Alimama。Alimama是阿里巴巴集团旗下的商业数字营销平台，它依托集团核心的商业数据和超级媒体矩阵，为客户提供全链路的消费者运营解决方案，旨在让商业营销变得更简单、更高效。

Deed – Attribution-ShareAlike 4.0 International – Creative Commons

引用要求

Reference and Related Publications：
1）Gai K, Zhu X, Li H, et al. Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction[J]. arXiv preprint arXiv:1704.05194, 2017.
2）Guorui Zhou, Chengru Song, Xiaoqiang Zhu, et al. Deep Interest Network for Click-Through Rate Prediction.https://arxiv.org/abs/1706.06978.

三、分析样例

下面我们以阿里巴巴提供的淘宝展示广告点击率数据集为例进行分析。

导入开发库

import numpy as np
import pandas as pd
from datetime import datetime
import pytz
cst = pytz.timezone('Asia/Shanghai')
utc = pytz.utc

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 120

导入数据ad_feature.csv

feature_df = pd.read_csv('data/ad_feature.csv')
feature_df.head(2)

查看数据统计值

print( 'Ads: %d' % len(feature_df))
print( 'Categories: %d' % feature_df['cate_id'].nunique())
print( 'campaigns: %d' % feature_df['campaign_id'].nunique())
print( 'Advertisers : %d' % feature_df['customer'].nunique())
print( 'Brands : %d' % feature_df['brand'].nunique())

输出：

Ads: 846811
Categories: 6769
campaigns: 423436
Advertisers : 255875
Brands : 99814

导入数据user_profile.csv

user_df = pd.read_csv('data/user_profile.csv')
user_df.head(2)

导入数据raw_sample_1m.csv

样例中只导入100万条记录，同时把时间戳转成UTC时间。

n_rows = 1000*1000  # 指定前100万条记录
raw_df = pd.read_csv('data/raw_sample_1m.csv', nrows=n_rows)
raw_df['time_CST'] = raw_df['time_stamp']\
                        .apply(lambda x: utc.localize(datetime.utcfromtimestamp(x)).astimezone(cst))

数据分析

查看点击率

ratio = sum(raw_df['clk'])/len(raw_df)
print('Only bout {:.2f}% results are "click"'.format(ratio*100))

输出：

Only bout 4.69% results are "click"

验证一些直觉

目标人群在不同类别（性别、年龄）中有差异；
价格/品牌与收入相关（也与城市水平、职业等相关）；
用户对类似同一类别、品牌的商品，更有可能点击。

对数据进行预处理、合并和聚合操作，最终得到按品牌分组并满足特定条件的用户数据。

user_df_oh = pd.get_dummies(user_df[['userid','final_gender_code','age_level','shopping_level', 'occupation','new_user_class_level ']]\
                            , columns =['final_gender_code','age_level','shopping_level', 'occupation','new_user_class_level ']
                            , prefix=['gender','age','shopping', 'student','city'])

raw_joined = raw_df.join(user_df_oh.set_index('userid'), on='user')\
                   .join(feature_df[['adgroup_id', 'cate_id', 'brand', 'price']].set_index('adgroup_id'), on='adgroup_id')
f =  {'user':['count'], 'gender_1':['sum'], 'age_1':['sum'], 'age_2':['sum'], 'age_3':['sum'], 'age_4':['sum'],\
       'age_5':['sum'], 'age_6':['sum'], 'shopping_2':['sum'], 'shopping_3':['sum'] ,'student_1':['sum'],\
      'city_2.0':['sum'], 'city_3.0':['sum'], 'city_4.0':['sum']}

# devices_report = trips_report.groupby('device_id').agg(f)

raw_by_brand_all = raw_joined.groupby(['brand']).agg(f)
raw_by_brand = raw_joined.groupby(['brand', 'clk']).agg(f)
raw_by_brand.columns = raw_by_brand.columns.get_level_values(0)
raw_by_brand_all.columns = raw_by_brand.columns.get_level_values(0)

raw_by_brand_all[raw_by_brand_all['user']> 5000].head()['user']

输出：

brand
353787.0    12195
品牌 353787.0 的总用户数为 12195。

通过筛选特定品牌的数据(raw_by_brand[‘brand’] == 220468)，计算比率，并选择特定列，最终得到包含比率信息的新 DataFrame。

shopping_2_ratio 和 shopping_3_ratio 代表了不同购物类别的用户占比。 shopping_2_ratio 是 shopping_2 类别的用户数除以该品牌总用户数。 shopping_3_ratio 是 shopping_3 类别的用户数除以该品牌总用户数。

raw_by_brand.reset_index(inplace =True)
temp = raw_by_brand[raw_by_brand['brand'] == 220468].copy()
temp.columns

cols = ['city_2.0', 'city_3.0', 'age_3','age_4', 'gender_1', 'student_1', 'age_5', 'age_1', 'age_2', 'city_4.0',
       'age_6', 'shopping_3', 'shopping_2']

for col in cols:
    temp[col +'_ratio'] = temp[col]/ temp['user']
temp.columns

temp[['shopping_2_ratio', 'shopping_3_ratio']]

输出：

    shopping_2_ratio    shopping_3_ratio
13440    0.045582    0.897966
13441    0.081633    0.884354

（shopping_2_ratio 和 shopping_3_ratio）展示了品牌 220468 在不同用户分类（shopping_2 和 shopping_3）下的用户比例。这些比率帮助我们了解在品牌的用户群体中，shopping_2 和 shopping_3 类别的用户分布情况，反映了不同品牌的用户行为模式。例如，品牌 220468 的大部分用户倾向于属于 shopping_3 类别（约 89%），而 shopping_2 类别用户占比相对较低（约 4.56% 到 8.16%）。

四、参考资料

Ad Display/Click Data on Taobao.com

淘宝展示广告点击率分析_淘宝展示广告点击率预估数据集-CSDN博客

五、获取案例套件

需要登录后才允许下载文件包。登录

一、问题描述

二、数据集内容

raw_sample.csv

ad_feature.csv

user_profile.csv

数据集版权许可协议

引用要求

三、分析样例

导入开发库

导入数据ad_feature.csv

导入数据user_profile.csv

导入数据raw_sample_1m.csv

数据分析

四、参考资料

五、获取案例套件

发表评论 取消回复

发表评论取消回复