淘宝2022搜索数据集和算法赛

需要登录后才允许下载文件包。登录
合集:AI案例-NLP-零售业
赛题:Alibaba2022-阿里灵杰-问天引擎电商搜索算法赛
主页:https://tianchi.aliyun.com/specials/promotion/opensearch
AI问题:语义相似度识别
数据集:淘宝搜索真实的业务场景下查询和商品语料库对应关系数据
数据集价值:提升电商搜索的精准性与智能化,优化商品搜索排序,让用户更快找到所需商品
解决方案:用户查询自然语言的理解、语义匹配、纠错、同义词扩展等。

一、赛题描述

背景

受疫情催化影响,近一年内全球电商及在线零售行业进入高速发展期。作为线上交易场景的重要购买入口,搜索行为背后是强烈的购买意愿,电商搜索质量的高低将直接决定最终的成交结果,因此在AI时代,如何通过构建智能搜索能力提升线上GMV转化率成为了众多电商开发者的重要研究课题。说明:而 GMV转化通常指通过运营手段将潜在的GMV(如用户浏览、加入购物车等行为)转化为实际成交的过程或效率。本次题目围绕电商领域搜索算法,开发者们可以通过基于阿里巴巴集团自研的高性能分布式搜索引擎问天引擎(提供高工程性能的电商智能搜索平台),快速迭代搜索算法,无需自主建设检索全链路环境。

目标

  • 提升电商搜索的精准性与智能化,优化商品搜索排序,让用户更快找到所需商品。减少无关或低质量商品的展示,提高转化率。
  • 解决长尾查询的匹配问题。电商搜索中,用户输入的查询词(Query)可能不精准(如错别字、模糊描述),算法需理解用户真实意图。例如:用户搜索“苹果手机壳”,但输入“苹果手机克”,系统仍需返回正确结果。
  • 多模态搜索优化。结合文本、图像、用户行为等多模态数据,提升搜索相关性。例如:用户搜索“红色连衣裙”,系统需结合颜色、款式、用户偏好进行推荐。
  • 冷启动问题。新上架商品或小众商品如何被准确匹配到搜索请求?
  • 个性化推荐与搜索结合。不同用户对同一搜索词的需求可能不同(如“苹果”可能是水果或手机),需结合用户历史行为优化排序。

任务

本次评测的数据来自于淘宝搜索真实的业务场景,其中整个搜索商品集合按照商品的类别随机抽样保证了数据的多样性,搜索用户查询输入和相关的商品来自点击行为日志并通过模型+人工确认的方式完成校验保证了训练和测试数据的准确性。比赛形式分为初赛和复赛两部分,分别从向量召回角度和精排模型角度让选手比拼算法模型。选手上传数据格式:评测数据必须包括doc_embedding,query_embedding两个文件,文件名必须固定。

二、数据集描述

数据说明如下:

语料库corpus.tsv

  • 介绍:语料库,从淘宝商品搜索的标题数据随机抽取的文档/doc,量级约100万。
  • 格式:doc_id从1开始编号的,title是是商品标题。

数据样例:

1 铂盛弹盖文艺保温杯学生男女情侣车载时尚英文锁扣不锈钢真空水杯
2 可爱虎子华为荣耀X30i手机壳荣耀x30防摔全包镜头honorx30max液态硅胶虎年情侣女卡通手机套插画呆萌个性创意
3 190色素色亚麻棉平纹布料 衬衫裙服装定制手工绣花面料 汇典亚麻
4 松尼合金木工开孔器实木门开锁孔木板圆形打空神器定位打孔钻头
5 微钩绿蝴蝶材料包非成品 赠送视频组装教程 需自备钩针染料
6 春秋薄绒黑色打底袜女外穿高腰显瘦大码胖mm纯棉踩脚一体连袜裤
7 New Balance/NB时尚长款过膝连帽保暖羽绒服女外套NCNPA/NPA46032
8 2021博洋高级l拉舍尔云毯结婚庆毛毯子冬季加厚保暖被子珊瑚
9 玉手牌平安无事牌天然翡翠a货男女款调节编织玉手链冰种玉石手串
10 欧货加绒拼接开叉纽扣烟管裤女潮2021秋季高腰显瘦九分直筒牛仔裤

查询数据train.query.txt

  • 介绍:训练集的query,训练集量级为10万。
  • 格式:query_id从1开始编号,query是搜索日志中抽取的查询词。

数据样例:

1 unidays
2 溪木源樱花奶盖身体乳
3 除尘布袋工业
4 双层空气层针织布料
5 4812锂电
6 鈴木雨燕方向機總成
7 福特翼搏1.5l变速箱电脑模块
8 a4红格纸
9 岳普湖驴乃
10 婴儿口罩0到6月医用专用婴幼儿

查询和语料库对应关系qrels.train.tsv

  • 介绍:训练集中的查询query与语料库/doc对应关系,训练集量级为10万。
  • 格式:query_iddoc_id。数据来自于搜索点击日志,人工标注query和doc之间具备高相关性,训练集用来训练模型。

数据样例:

1 28
2 37
3 51
4 52
5 77

查询测试集dev.query.txt

  • 介绍:测试集的query,测试集量级为1000。
  • 格式:query_idquery,训练集id从1开始编号,测试集id从200001开始编号,query是搜索日志中抽取的查询词。

数据样例:

200001 鈴木雨燕方向機總成
200002 福特翼搏1.5l变速箱电脑模块
200003 a4红格纸
200004 岳普湖驴乃
200005 婴儿口罩0到6月医用专用婴幼儿

注:比赛数据文件列之间的分隔符均为tab符(\t)

数据集版权许可协议

BY-NC-SA 4.0
https://creativecommons.org/licenses/by-nc-sa/4.0/deed.zh-hans

三、解决方案样例

运行环境

conda create --name gensim432-p10 python=3.10
conda activate gensim432-p10
conda install -c conda-forge gensim=4.3.2
conda install -c conda-forge jieba
conda install -c conda-forge pandas tqdm joblib scikit-learn
conda install -c conda-forge ipywidgets


conda list gensim
# Name                   Version                   Build Channel
gensim                   4.3.2           py312h2ab9e98_1   conda-forge

conda list scipy
# packages in environment at C:\AppData\Conda-Data\envs\gensim432-p10:
#
# Name                   Version                   Build Channel
scipy                     1.15.2         py310h15c175c_0   conda-forge

# scipy 版本与 gensim 4.3.2 的兼容性上。新版本的 scipy(≥1.11.0)移除了 scipy.linalg.triu,而旧版 gensim 仍依赖此函数。
# 降级 scipy 到 1.10.1(与 gensim 4.3.2 兼容)
conda install -c conda-forge scipy=1.10.1

conda list jieba
jieba                     0.42.1             pyhd8ed1ab_1   conda-forge

加载开发包

import numpy as np
import pandas as pd
import os
import jieba
from gensim.models import Word2Vec
from tqdm import tqdm_notebook
from joblib import Parallel, delayed
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings

源码结构

1. 读入数据集

corpus_data = pd.read_csv( "./data/corpus.tsv", sep="\t", names=["doc", "title"])
dev_data = pd.read_csv("./data/dev.query.txt", sep="\t", names=["query", "title"])
train_data = pd.read_csv("./data/train.query.txt", sep="\t", names=["query", "title"], on_bad_lines='skip')
qrels = pd.read_csv("./data/qrels.train.tsv", sep="\t", names=["query", "doc"])

加载了4个主要数据文件:

  • corpus.tsv: 商品文档数据(doc_id, title)
  • train.query.txt/dev.query.txt: 训练集/验证集查询数据(query_id, query_text)
  • qrels.train.tsv: 查询-文档相关性标注(query_id, doc_id)

数据预处理

  • 使用jieba进行中文分词
  • 使用并行处理加速分词过程(Parallel+n_jobs=4)

2. 训练词向量模型

if os.path.exists("word2vec.model"):
   model = Word2Vec.load("word2vec.model")
else:
   model = Word2Vec(
       sentences=list(corpus_title) + list(train_title) + list(dev_title),
       vector_size=128,
       window=5,
       min_count=1,
       workers=4,
  )
   model.save("word2vec.model")
  • 使用gensim的Word2Vec训练词向量
  • 参数设置:
    • vector_size=128
    • window=5
    • min_count=1
    • workers=4
  • 训练数据包括商品标题和查询文本
  • 保存模型供后续使用(word2vec.model)

3. 训练 TF-IDF 模型

TF-IDF(Term Frequency-Inverse Document Frequency)一种广泛应用于自然语言处理(NLP)和信息检索的技术,用于评估一个词语在文档集合或语料库中的重要性。

TF(Term Frequency,词频)​

  • 定义:某个词在文档中出现的次数。
  • 公式:TF(t,d)=文档 d 中的总词数词 t 在文档 d 中出现的次数​
  • 作用:衡量词语在单篇文档中的常见程度。例如,在文档“苹果香蕉苹果”中,“苹果”的TF值为 32​。

IDF(Inverse Document Frequency,逆文档频率)​

  • 定义:衡量词语在文档集合中的稀有程度。罕见词具有更高的IDF值。
  • 公式:IDF(t)=log(包含词 t 的文档数总文档数​)
  • 作用:降低常见词的权重,突出重要词汇。例如,若“的”出现在所有文档中,则其IDF值趋近于0。
idf = TfidfVectorizer(analyzer=lambda x: x)
idf.fit(train_title + corpus_title)
  • 使用sklearn的TfidfVectorizer
  • 分析器直接使用分词结果(analyzer=lambda x: x)
  • 用于后续特征选择和权重计算

4. 句子编码

def unsuper_w2c_encoding(s, pooling="max"):
   feat = []
   corpus_query_word = [x for x in s if x not in drop_token_ids]
   if len(corpus_query_word) == 0:
       return np.zeros(128)
   
   feat = model.wv[corpus_query_word]

   if pooling == "max":
       return np.array(feat).max(0)
   if pooling == "avg":
       return np.array(feat).mean(0)


corpus_mean_feat = [
   unsuper_w2c_encoding(s) for s in tqdm_notebook(corpus_w2v_ids[:1000])
]
corpus_mean_feat = np.vstack(corpus_mean_feat)

train_mean_feat = [
   unsuper_w2c_encoding(s) for s in tqdm_notebook(train_w2v_ids[:100])
]
train_mean_feat = np.vstack(train_mean_feat)

dev_mean_feat = [
   unsuper_w2c_encoding(s) for s in tqdm_notebook(dev_w2v_ids[:100])
]
dev_mean_feat = np.vstack(dev_mean_feat)

特征选择

  • 基于TF-IDF值过滤低频词(idf < 10)
  • 手动添加了”领券”等需要过滤的词

句子编码

  • 实现了unsuper_w2c_encoding函数将句子转换为向量:
    1. 过滤停用词(drop_token_ids)
    2. 获取每个词的Word2Vec向量
    3. 使用max pooling或avg pooling聚合词向量
  • 对商品和查询文本都进行了向量编码

5. 初步检索

with open('query_embedding', 'w') as up:
   for id, feat in zip(dev_data.index, dev_mean_feat):
       up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:6] for x in feat])))

with open('doc_embedding', 'w') as up:
   for id, feat in zip(corpus_data.index, corpus_mean_feat):
       up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:6] for x in feat])))

召回策略

向量归一化

from sklearn.preprocessing import normalize

使用sklearn的normalize对向量进行L2归一化。

相似度计算

  • 计算查询向量与商品向量的点积作为相似度
  • 按相似度排序得到召回结果

评估指标

  • 计算MRR(Mean Reciprocal Rank)评估召回效果

执行结果

按照初赛要求,输出了如下两个文件:

  • doc_embedding:语料库embedding,100万语料库通过选手训练的向量召回模型转化后的向量,维度限制128维。
1   -0.037,0.0108,-0.060,-0.186,0.0266,0.0462,-0.009,0.0775,0.0973,0.1036,-0.017,-0.006,-0.072,0.0730,-0.147,-0.172,0.0933,-0.225,0.0040,0.1069,-0.016,-0.088,0.2392,-0.063,0.0169,-0.042,-0.152,0.0553,-0.124,0.0892,-0.069,-0.034,-7.984,-0.084,0.0314,-0.180,0.0322,0.0436,0.1317,0.0579,-0.060,-0.079,-0.043,0.0537,-0.028,-0.056,0.1506,0.0064,-0.032,-0.020,-0.087,0.0825,0.0893,0.1984,-0.154,-0.003,-0.038,0.1687,0.0446,0.0083,-0.076,0.0243,-0.109,0.0611,-0.098,0.0050,0.0072,0.1201,-0.075,0.0904,0.0551,-0.068,0.1408,0.0436,0.1182,-0.026,-0.037,0.0700,0.1961,-0.022,-0.005,-0.060,-0.072,0.0082,-0.057,-0.046,-0.161,0.0796,-0.065,-0.085,-0.100,0.0031,0.0610,0.0271,0.0047,-0.038,0.0936,0.0360,0.0981,0.0505,0.0166,-0.053,-0.131,-0.183,0.0073,0.0852,0.0748,-0.047,0.1050,0.1488,-0.151,-0.006,0.0927,-0.007,-0.145,-0.063,-0.008,0.1193,0.0094,0.0487,-0.015,-0.034,-0.009,0.0637,0.0987,0.0756,0.0128,-0.036
2 0.0108,0.0239,0.1327,0.0258,0.0823,0.0572,0.0535,-0.010,0.0198,0.1242,0.0568,0.1003,0.0508,-0.005,0.0952,0.1271,0.0112,0.1636,0.1389,0.0592,0.0664,0.1742,0.0022,-0.023,-0.019,0.0610,0.0292,0.1737,0.1185,0.0200,0.0545,0.1662,0.0453,0.0586,0.0060,0.0211,0.0696,0.0039,0.0236,0.0976,0.0017,0.0746,0.0702,0.0100,0.1110,0.0422,-0.017,0.0167,0.0621,0.1251,0.0977,0.0578,0.0591,0.0731,-0.019,0.0160,0.0853,0.1814,-0.001,0.0822,0.0806,0.0354,0.0025,-0.001,0.1832,0.0015,0.0419,0.0423,0.0589,-0.008,0.0607,0.0020,-0.006,-0.002,0.0142,-0.011,0.0690,0.1823,0.0094,0.3633,-0.018,-0.007,0.1711,0.2120,0.2525,0.0935,0.0464,0.0748,0.0655,0.1513,-0.002,-0.010,0.0394,0.0931,0.0346,0.0956,0.1043,0.0014,0.0604,-0.009,-0.012,0.1373,0.0082,0.0251,0.0255,0.0255,0.0503,0.0263,0.0851,-0.009,0.1332,0.0118,0.0082,0.0676,0.0035,-0.000,0.1004,0.1003,0.0488,0.3012,0.0049,-0.002,-0.009,0.0428,-0.012,0.0282,0.0080,0.0052
3 0.0555,0.0059,0.0445,0.2261,0.1566,0.1399,0.0001,0.0250,0.0010,0.2106,0.1796,0.0221,-0.009,0.0207,0.1171,0.0909,0.1104,0.0621,0.0439,0.0093,0.0368,-0.001,0.1604,0.0186,0.0198,0.0039,0.1088,0.1899,0.0520,0.0934,0.0767,-0.007,0.0756,0.0533,0.0043,0.1245,0.1046,0.0517,0.2007,-0.000,0.1741,0.0392,0.0171,0.0102,-0.002,0.0393,-0.003,0.0199,-0.004,0.0411,0.1674,0.1213,0.0235,0.2684,-0.004,0.0188,-0.002,0.0592,-0.004,0.0023,0.0844,0.1195,0.0755,0.0247,0.0507,0.1384,0.1259,0.0786,0.0866,0.0390,0.0119,0.0043,0.0792,-0.006,0.0238,0.0547,0.0029,0.2538,0.0405,0.1645,0.0394,0.0786,0.0145,0.1162,0.0196,0.0148,0.0323,0.0025,-0.002,0.0116,0.1574,0.0024,0.0158,0.0973,0.0578,0.0641,0.0001,-0.009,0.1074,0.0335,-0.016,0.0037,0.0201,0.0066,0.1696,0.1011,-0.009,0.1108,0.0115,0.1697,0.0995,0.0021,0.2287,0.0177,0.1628,-0.008,0.1075,0.0050,0.0467,-0.003,0.0816,0.0141,0.0042,0.0428,0.0023,0.0086,-0.005,0.0113
  • query_embedding:测试集embedding,1000条测试集query通过选手训练的向量召回模型转化后的向量,维度限制128维。
200001  0.0584,-0.010,0.0847,0.1275,0.1650,-0.014,0.0692,-0.001,0.0375,0.1136,0.2321,-0.016,-0.021,-0.008,0.1570,0.2328,-0.000,0.0146,-0.033,0.1298,0.1979,0.0927,-0.015,-0.003,-0.021,0.0804,-0.016,0.1946,0.0509,-0.011,-1.238,0.0040,-0.002,0.0424,0.0032,0.0812,0.1875,-0.000,0.1247,0.0084,0.0702,0.1478,-0.007,-0.030,0.1271,0.0623,-0.026,-0.008,0.0412,0.1681,0.1187,0.0519,0.0448,0.1579,0.0003,0.0235,0.1668,-0.006,-0.029,-0.028,-0.020,0.1007,0.0773,-0.015,0.2344,-0.013,0.0839,0.0419,0.0027,-0.005,0.0883,-0.011,-0.030,-0.001,0.0230,-0.024,-0.011,0.0203,-0.035,0.2484,-0.019,-0.006,0.1601,0.2025,0.0466,0.1101,0.2108,-0.006,0.0938,0.2353,0.0492,-0.034,0.0552,0.0648,0.1057,0.0577,-0.016,-0.012,-0.010,-0.009,-0.027,-0.009,-0.001,0.0490,0.1329,0.0138,-0.012,0.0106,0.1588,-0.013,0.0498,0.0058,0.0432,0.0050,0.0014,0.0038,0.0900,0.0802,0.0731,0.0476,-0.005,-0.030,-0.033,0.1101,-0.003,-0.021,0.0009,0.0925
200002 0.0506,-0.029,0.2861,0.0941,0.1468,-0.022,0.0768,0.0004,-0.007,0.0740,0.2206,-0.018,0.0007,-0.010,0.2014,0.0590,-0.014,0.0047,-0.026,0.0796,0.1709,0.1767,0.0010,-0.015,-0.004,0.1267,0.0018,0.2055,0.0238,-0.017,-0.000,0.0831,0.0124,0.0496,0.0076,-0.000,0.1349,0.0020,0.1217,0.0016,-0.002,0.1456,0.0065,0.0488,0.0689,0.0344,-0.008,-0.000,0.0052,0.1989,0.1017,0.0487,-0.004,0.0088,-0.007,0.0743,0.2140,-0.004,-0.036,0.0279,8.4322,0.0015,0.0822,-0.002,0.1392,-0.000,0.0024,0.0397,0.0052,-0.004,0.1630,-0.009,-0.033,-0.019,0.0184,-0.019,-0.006,-0.006,-0.018,0.1630,-0.016,0.0012,0.1622,0.3004,0.1054,0.1677,0.1869,-0.001,0.0314,0.2971,0.0308,-0.009,0.1172,-0.004,0.1696,0.0504,-0.004,-0.033,0.0483,-0.012,-0.022,-0.006,-0.007,0.0233,0.0681,-0.001,0.0094,0.0180,0.0370,-0.019,0.0084,-0.017,0.0091,0.1716,-0.022,0.0032,0.0441,0.0510,0.1043,0.0129,-0.014,-0.003,-0.014,0.0669,-0.025,0.0048,-0.007,0.0055
200003 0.0279,-0.036,0.0131,0.0705,0.1110,-0.029,-0.019,-0.051,0.1161,0.0463,0.0027,-0.113,-0.047,-0.194,0.0380,0.1982,-0.111,-0.023,-0.166,-0.005,0.1529,0.1717,0.0456,-0.142,-0.053,0.0241,-0.046,0.1103,0.1892,0.0693,-0.039,-0.025,-0.043,-0.086,0.0273,0.0338,0.0406,-0.094,0.0101,-0.011,0.0197,0.0009,-0.045,-0.104,-0.074,0.1697,-0.087,-0.109,0.0168,0.0042,0.0742,-0.056,0.0153,0.0730,-0.094,-0.086,0.1073,0.0178,-0.089,0.0574,-0.077,-0.111,0.0180,-0.054,0.0682,-0.045,-0.015,0.0402,-0.005,0.0848,-0.027,-0.185,-0.078,-0.080,0.0075,-0.099,0.0225,0.0341,-0.156,0.1968,-0.063,-0.056,0.1211,0.1891,0.0772,0.0823,0.0537,-0.031,0.0900,0.1801,0.0150,-0.016,0.0553,0.0944,0.1389,0.0416,0.0321,-0.090,-0.075,-0.084,-0.018,0.1620,-0.070,0.1436,0.0590,-0.171,0.0365,0.0309,-0.007,-0.010,0.1056,-0.118,0.1003,-0.147,0.0525,0.0399,0.1226,0.0182,0.0742,-0.044,-0.040,-0.108,-0.046,-0.016,-0.088,-0.141,-0.015,-0.013

源码开源协议

GPL-v3
https://zhuanlan.zhihu.com/p/608456168

四、获取案例套装

需要登录后才允许下载文件包。登录 需要登录后才允许下载文件包。登录

发表评论