DCIC2020渔船北斗数据集和智慧海洋建设算法赛

集合：ML-交通

一、赛题描述

赛题：数字中国创新大赛(DCIC)算法赛2020-智慧海洋建设
主办方：数字中国建设峰会组委会
主页：https://tianchi.aliyun.com/competition/entrance/231768/introduction

背景

2020数字中国创新大赛 (Digital China Innovation Contest, DCIC 2020)，以“培育数字经济新动能，助推数字中国新发展”为主题，采取多赛道并行的竞赛形式，采用4+1的赛事架构，设置了数字政府、智慧医疗、鲲鹏计算、网络安全等4个赛道和中小学生赛道，打造具有全国影响力的顶级赛事。

数字政府赛道作为数字中国创新大赛的主要赛道之一，以推动福建省政务数字化升级为契机，围绕政务大数据展开，聚焦智慧海洋、政务服务、智慧社区、城市管理四大领域，以应用为导向，聚集全球顶级技术创新人才，发掘先进的人工智能与政务融合的创新应用成果。利用大数据提升政府治理能力和水平，加速福州乃至福建全省数字经济新业态的形成，持续诠释“数字中国”国家战略。

数字政府赛道采用“1+3”双赛制模式，分 “智能算法赛”和“创新应用赛”两个赛场。本赛道入围总决赛的队伍，将获邀参加在“第三届数字中国建设峰会”期间举行的2020数字中国创新大赛总决赛并进行公开路演。

任务

本赛题为智能算法赛，命题式竞赛，选题围绕“智慧海洋建设，赋能海上安全治理能力现代化”展开。提升海上安全治理能力，首要任务是“看得清”，即看得清“是什么、谁在用、做什么”。船舶避碰终端（AIS）、北斗定位终端等通信导航设备的应用，给海上交通和作业带来了极大便利，但同时存在设备信息使用不规范造成的巨大人身和财产损失，给海上安全治理带来了新的挑战。

本赛题基于位置数据对海上目标进行智能识别和作业行为分析，要求选手通过分析渔船北斗设备位置数据，得出该船的生产作业行为，具体判断出是拖网作业、围网作业还是流刺网作业。同时，希望选手通过数据可视分析，挖掘更多海洋通信导航设备的应用价值。

本赛题主要面向高校师生、科研人员、企业开发者等参赛个人，吸引全社会的参与，鼓励万众创新。

二、数据集描述

数据说明

初赛提供11,000条渔船北斗数据，数据包含脱敏后的渔船ID、经纬度坐标、上报时间、速度、航向信息，由于真实场景下海上环境复杂，经常出现信号丢失，设备故障等原因导致的上报坐标错误、上报数据丢失、甚至有些设备疯狂上报等。

数据示例：

渔船ID	x	y	速度	方向	time	type
1102	6283649.656204367	5284013.963699763	3	12.1	0921 09:00	围网
1	6076254.189784355	5061742.567340344	3.99	278	1110 11:40	拖网
1	6077380.014386079	5061818.947242365	4.26	257	1110 11:33	拖网
6337	6336123.942614388	5376720.822300289	0.0	115	1123 23:58	围网
6337	6336123.942614388	5376720.822300289	0.0	0	1123 23:48	围网

渔船ID：渔船的唯一识别，结果文件以此ID为标示 x: 渔船在平面坐标系的x轴坐标 y: 渔船在平面坐标系的y轴坐标速度：渔船当前时刻航速，单位节方向：渔船当前时刻航首向，单位度 time：数据上报时刻，单位月日时：分 type：渔船label，作业类型

原始数据经过脱敏处理，渔船信息被隐去，坐标等信息精度和位置被转换偏移。选手可通过学习围网、刺网、拖网等专业知识辅助大赛数据处理。船舶避碰终端AIS数据

ais_id	lon	lat	船速	航向	time
110	119.6705	26.5938	3	12.1	0921 09:00

ais_id：AIS设备的唯一识别ID

数据集版权许可协议

BY-NC-SA 4.0
https://creativecommons.org/licenses/by-nc-sa/4.0/deed.zh-hans

三、解决方案样例

工作原理介绍

LightGBM (Light Gradient Boosting Machine)

由微软亚洲研究院 (MSRA) 提出的梯度提升框架，专注于高效处理大规模数据和高维特征。

核心优势：速度快、内存占用低、支持类别特征、防止过拟合能力强。
适用场景：推荐系统、广告点击率预测、金融风控等需要高性能的机器学习任务。
发展背景：
- 传统梯度提升框架（如XGBoost）在大数据场景下计算效率较低。
- LightGBM通过基于直方图的决策树算法和直通叶子节点 (GOSS) 等优化技术，显著提升性能。

核心原理

基于直方图的决策树算法：
- 将连续特征值离散化为固定数量的直方图 bins，减少内存占用和计算量。
- 例如：将特征值 [1.2, 3.4, 5.6] 离散化为 bins [0, 1, 2]，每个 bin 对应一个区间范围。
梯度单边采样 (GOSS)：
- 在计算信息增益时，仅保留梯度较大的样本 (Top 20%) 和随机采样部分小梯度样本，减少计算量。
互斥特征捆绑 (EFB)：
- 将互斥的低频特征合并为一个特征，降低维度并加速训练。

运行环境

外部库名称	版本号
python	3.12.8
sklearn-compat	0.1.3
lightgbm	3.3.0

源码结构

1. 读入数据集

train_feat = Parallel(n_jobs=10)(delayed(read_feat)(path, True) 
                                 for path in glob.glob('./data/hy_round1_train_20200102/*')[:])
train_feat = pd.DataFrame(train_feat)

test_feat = Parallel(n_jobs=10)(delayed(read_feat)(path, False) 
                                for path in glob.glob('./data/hy_round1_testA_20200102/*')[:])
test_feat = pd.DataFrame(test_feat)
test_feat = test_feat.sort_values(by=0)

2. 定义并训练 LightGBM 模型

params = {
    'learning_rate': 0.01,
    'min_child_samples': 5,
    'max_depth': 7,
    'lambda_l1': 2,
    'boosting': 'gbdt',
    'objective': 'multiclass',
    'n_estimators': 2000,
    'metric': 'multi_error',
    'num_class': 3,
    'feature_fraction': .75,
    'bagging_fraction': .85,
    'seed': 99,
    'num_threads': 20,
    'verbose': -1
}

train_pred, test_pred = run_oof(lgb.LGBMClassifier(**params), 
                                train_feat.iloc[:, 2:].values, 
                                train_feat.iloc[:, 1].values, 
                                test_feat.iloc[:, 1:].values, 
                                skf)

输出：

LGBMClassifier(bagging_fraction=0.85, boosting='gbdt', feature_fraction=0.75,
               lambda_l1=2, learning_rate=0.01, max_depth=7,
               metric='multi_error', min_child_samples=5, n_estimators=2000,
               num_class=3, num_threads=20, objective='multiclass', seed=99,
               verbose=1)
[LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] lambda_l1 is set=2, reg_alpha=0.0 will be ignored. Current value: lambda_l1=2
[LightGBM] [Warning] bagging_fraction is set=0.85, subsample=1.0 will be ignored. Current value: bagging_fraction=0.85
[LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] lambda_l1 is set=2, reg_alpha=0.0 will be ignored. Current value: lambda_l1=2
[LightGBM] [Warning] bagging_fraction is set=0.85, subsample=1.0 will be ignored. Current value: bagging_fraction=0.85
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003714 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6205
[LightGBM] [Info] Number of data points in the train set: 6300, number of used features: 33
[LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] lambda_l1 is set=2, reg_alpha=0.0 will be ignored. Current value: lambda_l1=2
[LightGBM] [Warning] bagging_fraction is set=0.85, subsample=1.0 will be ignored. Current value: bagging_fraction=0.85
[LightGBM] [Info] Start training from score -1.462798
[LightGBM] [Info] Start training from score -1.927197
[LightGBM] [Info] Start training from score -0.473438
Training until validation scores don't improve for 500 rounds
...
--------------------------------------------------
Train0.90835_Test0.65570

3. 预测作业类型并输出

test_feat['label'] = np.argmax(test_pred, 1)
test_feat['label'] = test_feat['label'].map({0:'围网',1:'刺网',2:'拖网'})
test_feat[[0, 'label']].to_csv('submit.csv',index=None, header=None)

上述输出文件submit.csv的数据示例可参见下表：

渔船ID	type
7000	围网
7001	拖网
7002	拖网
7003	刺网
7004	拖网

源码开源协议

GPL-v3
https://zhuanlan.zhihu.com/p/608456168

四、获取案例套装

需要登录后才允许下载文件包。登录需要登录后才允许下载文件包。登录需要登录后才允许下载文件包。登录