科大讯飞2022基于论文摘要的文本分类与查询性问答

摘要:

合集:AI案例-NLP-传媒业
赛题:基于论文摘要的文本分类与查询性问答
主办方:沈阳药科大学
主页:http://challenge.xfyun.cn/topic/info?type=abstract
AI问题:文本分类
数据集:包括标题、作者、引用、摘要、DOI、Topic(标注)属性信息的论文数据集。
数据集价值:支持论文分类。
解决方案:使用TF-IDF评估一个词语在文档集合或语料库中的重要性、构建bert-base-uncased模型

一、赛题描述

背景

在人工智能领域的学习中,研读有关文献是非常重要的学习途径,而如何在汗牛充栋的论文库中,高效快速的检索到相关重要文献,就成为知识学习首先要解决的难点。

任务

机器通过对论文摘要等信息的理解,划分论文类别。

输入:

论文信息,格式如下:

Sensors (Basel). 2022 May 20;22(10):3899. doi: 10.3390/s22103899.

Smart Sensorization Using Propositional Dynamic Logic.

Merino S(1), Burrieza A(2), Guzman F(3), Martinez J(1).

Author information:

(1)Department of Applied Mathematics, University of Malaga, 29071 Malaga, Spain.

(2)Department of Philosophy, University of Malaga, 29071 Malaga, Spain.

(3)Department of Electrical Engineering, University of Malaga, 29071 Malaga, Spain.

The current high energy prices pose a serious challenge, especially in the domestic economy. In this respect, one of the main problems is obtaining domestic hot water. For this reason, this article develops a heating system applied to a conventional water tank in such a way as to minimize the necessary energy supply by converting it, under certain circumstances, into atmospheric. For this purpose, the domotic system has been equipped with sensors that automate the pressurization of the compartment and solenoid valves that regulate the external water supply. This design, to which different level sensors are applied, sends the information in real time to an artificial intelligence system, by means of deductive control, which recognizes the states of the system. This work shows the introduction of an extension of propositional dynamic logic in the field of energy efficiency. Thanks to this formalism, a qualitative control of the program variables is achieved by incorporating qualitative reasoning tools. On the otherhand, it solves preventive maintenance systems through the early detection of faults in the installation. This research has led to the patenting of an intelligent domestic hot water system that considerably reduces energy consumption by setting disjointed heating intervals that, powered by renewable or non-renewable sources, are controlled by a propositional dynamic logic.

DOI: 10.3390/s22103899

PMID: 35632307

输出:电气

二、数据集描述

数据说明

训练集与测试集数据为CSV格式文件,各字段分别是标题、作者、引用、摘要、DOI、Topic(Label)。

数据样例:

TitleAuthorsCitationAbstractDOITopic(Label)
The Value of First-trimester Maternal Abdominal Visceral Adipose Tissue Thickness in Predicting the Subsequent Development of Gestational Diabetes Mellitus[‘Seyhmus Tunc’, ‘Suleyman Cemil Oglak’, …]2022 Jun;32(6):722-727.Objective: To examine the performance of first-trimester visceral (pre-peritoneal), subcutaneous, and total adipose tissue thickness (ATT) to …doi: 10.29271/jcpsp.2022.06.722.Abdominal+Fat

数据集版权许可协议

BY-NC-SA 4.0
https://creativecommons.org/licenses/by-nc-sa/4.0/deed.zh-hans

三、解决方案样例

工作原理介绍

1. TF-IDF(Term Frequency-Inverse Document Frequency)

一种广泛应用于自然语言处理(NLP)和信息检索的技术,用于评估一个词语在文档集合或语料库中的重要性。

TF(Term Frequency,词频)​
  • 定义:某个词在文档中出现的次数。
  • 公式:TF(t,d)=文档 d 中的总词数词 t 在文档 d 中出现的次数​
  • 作用:衡量词语在单篇文档中的常见程度。例如,在文档“苹果香蕉苹果”中,“苹果”的TF值为 32​。
IDF(Inverse Document Frequency,逆文档频率)​
  • 定义:衡量词语在文档集合中的稀有程度。罕见词具有更高的IDF值。
  • 公式:IDF(t)=log(包含词 t 的文档数总文档数​)
  • 作用:降低常见词的权重,突出重要词汇。例如,若“的”出现在所有文档中,则其IDF值趋近于0。

2. BERT-base-Uncased

  • bert-base-uncased 是谷歌发布的 ​英文预训练BERT模型,基于Transformer架构,采用不区分大小写​(Lowercase)的预训练策略。
    • 预训练:在包含BooksCorpus(8亿词)和英文维基百科(25亿词)的大规模文本上训练,学习通用语言表示。
    • 模型规模:
      • Base版:12层Transformer,768隐藏维度,12个注意力头,约110M参数。
      • bert-large-uncased相比,参数量和层数更小,适合通用任务。
  • 核心特点:
    • 输入文本统一转为小写(Uncased),简化预处理。
    • 双向上下文建模(通过Masked Language Model任务)。
    • 适用于英文NLP任务(分类、NER、问答等)。
  • BERT-base-Uncased 的文本分类本质是:利用预训练的语言理解能力,通过微调使模型适配特定任务的分类边界。其核心优势在于无需从头训练,即可通过少量数据达到高性能。

运行环境

外部库名称版本号
python3.12.3
sklearn-compat0.1.3
torch2.5.1
transformers4.49.0

加载开发包

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer # 分词器,词典
from transformers import AutoModelForSequenceClassification, AdamW

工作流程

实现了一个基于论文摘要的文本分类系统,使用了两种不同的方法:TF-IDF + 线性分类器和BERT深度学习模型。

1、数据准备

数据加载与预处理

  • 从CSV文件加载训练集(train.csv)和测试集(test.csv)
  • 训练集包含标题、作者、引用信息、摘要、DOI和类别标签
  • 测试集不包含类别标签,用于模型预测

数据清洗

  • 去除文本中的多余空白字符(strip())
  • 将摘要中的NaN值替换为空字符串
  • 将标题和摘要合并为一个文本字段
  • 将所有文本转换为小写

标签处理

  • 使用pd.factorize()将文本标签转换为数字编码
  • 保存标签映射关系用于后续预测结果的转换
train_df = pd.read_csv('./data/train.csv', sep=',')
test_df = pd.read_csv('./data/test.csv', sep=',')

train_df['Topic(Label)'], lbl = pd.factorize(train_df['Topic(Label)'])
train_df['Title'] = train_df['Title'].apply(lambda x: x.strip())
train_df['Abstract'] = train_df['Abstract'].fillna('').apply(lambda x: x.strip())
train_df['text'] = train_df['Title'] + ' ' + train_df['Abstract']
train_df['text'] = train_df['text'].str.lower()

test_df['Title'] = test_df['Title'].apply(lambda x: x.strip())
test_df['Abstract'] = test_df['Abstract'].fillna('').apply(lambda x: x.strip())
test_df['text'] = test_df['Title'] + ' ' + test_df['Abstract']
test_df['text'] = test_df['text'].str.lower()

2、训练TF-IDF模型

特征提取

  • 使用TfidfVectorizer将文本转换为TF-IDF特征向量
  • 限制最大特征数为2500,以控制特征维度

分类模型

  • 使用SGDClassifier(随机梯度下降分类器)作为分类模型
  • 通过5折交叉验证评估模型性能,准确率约88-89%

预测与输出

  • 在测试集上进行预测
  • 将数字预测结果转换回原始标签文本
  • 保存预测结果到submit.csv
tfidf = TfidfVectorizer(max_features=2500)
train_tfidf = tfidf.fit_transform(train_df['text'])

clf = SGDClassifier()
cross_val_score(clf, train_tfidf, train_df['Topic(Label)'], cv=5)
test_tfidf = tfidf.transform(test_df['text'])

clf = SGDClassifier()
clf.fit(train_tfidf, train_df['Topic(Label)'])
test_df['Topic(Label)'] = clf.predict(test_tfidf)

test_df['Topic(Label)'] = test_df['Topic(Label)'].apply(lambda x: lbl[x])
test_df[['Topic(Label)']].to_csv('submit.csv', index=None)

3、训练BERT模型

模型初始化

  • 使用bert-base-uncased预训练模型
  • 添加一个12分类的输出层(假设有12个类别)

数据编码

  • 使用BERT的分词器对文本进行编码
  • 设置最大长度为512(标准BERT输入长度)
  • 自动进行截断和填充

数据集封装

  • 自定义XunFeiDataset类继承torch.utils.data.Dataset
  • 将编码后的数据封装为PyTorch DataLoader,便于批量处理

模型训练

  • 使用AdamW优化器,学习率1e-5
  • 训练过程中记录和打印损失值
  • 每个batch进行梯度裁剪(max_norm=1.0)防止梯度爆炸
  • 训练1个epoch(约2060个batch)

预测与输出

  • 在测试集上进行预测
  • 取logits的最大值作为预测类别
  • 将数字预测结果转换回原始标签文本
  • 保存预测结果到bert_submit.csv
train_df = pd.read_csv('./data/train.csv', sep=',')
test_df = pd.read_csv('./data/test.csv', sep=',')
train_df = train_df[~train_df['Topic(Label)'].isnull()]
train_df['Topic(Label)'], lbl = pd.factorize(train_df['Topic(Label)'])

train_df['Title'] = train_df['Title'].apply(lambda x: x.strip())
train_df['Abstract'] = train_df['Abstract'].fillna('').apply(lambda x: x.strip())
train_df['text'] = train_df['Title'] + ' ' + train_df['Abstract']
train_df['text'] = train_df['text'].str.lower()

test_df['Title'] = test_df['Title'].apply(lambda x: x.strip())
test_df['Abstract'] = test_df['Abstract'].fillna('').apply(lambda x: x.strip())
test_df['text'] = test_df['Title'] + ' ' + test_df['Abstract']
test_df['text'] = test_df['text'].str.lower()

tokenizer = AutoTokenizer.from_pretrained('./bert-base-uncased')
train_encoding = tokenizer(train_df['text'].to_list()[:], truncation=True, padding=True, max_length=512)
test_encoding = tokenizer(test_df['text'].to_list()[:], truncation=True, padding=True, max_length=512)

# 数据集读取
class XunFeiDataset(Dataset):
  def __init__(self, encodings, labels):
      self.encodings = encodings
      self.labels = labels

  # 读取单个样本
  def __getitem__(self, idx):
      item = {key: torch.tensor(val[idx])
              for key, val in self.encodings.items()}
      item['labels'] = torch.tensor(int(self.labels[idx]))
      return item

  def __len__(self):
      return len(self.labels)

train_dataset = XunFeiDataset(train_encoding, train_df['Topic(Label)'].to_list())
test_dataset = XunFeiDataset(test_encoding, [0] * len(test_df))

# 单个读取到批量读取
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# 精度计算
def flat_accuracy(preds, labels):
  pred_flat = np.argmax(preds, axis=1).flatten()
  labels_flat = labels.flatten()
  return np.sum(pred_flat == labels_flat) / len(labels_flat)
   
model = AutoModelForSequenceClassification.from_pretrained('./bert-base-uncased', num_labels=12)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 优化方法
optim = AdamW(model.parameters(), lr=1e-5)
total_steps = len(train_loader) * 1    

训练模型

# 训练函数
def train():
  model.train()
  total_train_loss = 0
  iter_num = 0
  total_iter = len(train_loader)

  for batch in train_loader:
      # 正向传播
      optim.zero_grad()

      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      loss = outputs[0]
      total_train_loss += loss.item()

      # 反向梯度信息
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # 参数更新
      optim.step()
      # scheduler.step()

      iter_num += 1
      if (iter_num % 20 == 0):
          print("Epoth: %d, iter_num: %d, loss: %.4f, progress: %.2f%%" %
                (epoch, iter_num, loss.item(), iter_num/total_iter*100))

  print("Epoch: %d, Average training loss: %.4f" % (epoch, total_train_loss/len(train_loader)))

def validation():
  model.eval()
  total_eval_accuracy = 0
  total_eval_loss = 0
  for batch in test_dataloader:
      with torch.no_grad():
          # 正常传播
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)
          labels = batch['labels'].to(device)
          outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

      loss = outputs[0]
      logits = outputs[1]

      total_eval_loss += loss.item()
      logits = logits.detach().cpu().numpy()
      label_ids = labels.to('cpu').numpy()
      total_eval_accuracy += flat_accuracy(logits, label_ids)

  avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
  print("----------------- Validation -----------------")
  print("Accuracy: %.4f" % (avg_val_accuracy))
  print("Average testing loss: %.4f" % (total_eval_loss/len(test_dataloader)))
  print("----------------------------------------------")

for epoch in range(1):
  print("---------------- Epoch: %d ----------------" % epoch)
  train()
  # validation()

预测

def prediction():
  model.eval()
  test_label = []
  for batch in test_dataloader:
      with torch.no_grad():
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)

          pred = model(input_ids, attention_mask).logits
          test_label += list(pred.argmax(1).data.cpu().numpy())
  return test_label
   
   
test_predict = prediction()

test_df['Topic(Label)'] = [lbl[x] for x in test_predict]
test_df[['Topic(Label)']].to_csv('bert_submit.csv', index=None)    

两种方法对比

  1. TF-IDF + SGDClassifier:
    • 简单快速
    • 基于词频统计特征
    • 准确率约88-89%
    • 不捕捉词语间的上下文关系
  2. BERT模型:
    • 基于深度学习的预训练模型
    • 捕捉词语间的上下文关系
    • 需要更多计算资源
    • 准确率未在验证集上评估(代码中验证部分被注释)

运行结果

---------------- Epoch: 0 ----------------
Epoth: 0, iter_num: 20, loss: 2.6277, progress: 0.97%
Epoth: 0, iter_num: 40, loss: 2.4040, progress: 1.94%
...

输出数据样例:文件submit.csv

Topic(Label)
Gastrointestinal+Microbiome
Artificial+Intelligence
Gastrointestinal+Microbiome
Inflammation
Gastrointestinal+Microbiome
Inflammation
Artificial+Intelligence
Diabetes+Mellitus
Fasting
...

输出数据样例:文件bert_submit.csv

Topic(Label)
Gastrointestinal+Microbiome
Artificial+Intelligence
Gastrointestinal+Microbiome
Inflammation
Gastrointestinal+Microbiome
Inflammation
Artificial+Intelligence
Artificial+Intelligence
Abdominal+Fat
...

源码开源协议

GPL-v3

四、获取案例套装

需要登录后才允许下载文件包。登录

发表评论