学术论文

      数据驱动的细粒度中文属性对齐方法

      Data-driven method for fine-grained property alignment between Chinese open datasets

      摘要:
      为提高中文开源数据集间属性关系识别的准确率,提出一种数据驱动的细粒度对齐方法,综合利用属性的扩展、定义域等对属性间的同义、包含、相关等关系进行统一识别.方法首先利用统计理论确定属性的数据类型,并给出类型感知的属性相似度计算方法.在此基础上,将属性关系识别建模为多分类问题,抽取有效特征对不同关系进行描述并用于随机森林模型的构建.实验结果表明,该方法中属性数据类型判别的准确率达94.6%,最终对同义、包含、相关关系识别的F1值分别为71.3%,57.3%及59.9%.相比只关注同义属性的传统方法,细粒度属性对齐方法不仅提高了同义属性识别的准确性,而且可识别出相互包含和相关的属性,证明了其在中文开源数据集上的有效性.
      Abstract:
      In order to improve the performance of property alignment between heterogeneous Chinese open datasets, a data-driven method for fine-grained alignment is proposed, which exploits the extension and domain information of properties to find equivalence, subsumption and relevance relations between properties in a unified way.First, the data types of properties are determined utilizing statistical theory, and a type-aware metric is given to calculate the similarity of properties.Based on that, the property relation recognition is modeled as a multi-classification problem, and effective features are generated to represent different property relationships and construct the random forest classifier.The experimental results show that, the proposed method can reach a precision of 94.6% in determining data types of properties, and the final F1 measures in recognizing equivalent, subsumptive and relevant properties are 71.3%, 57.3% and 59.9%, respectively.Compared with the traditional approaches that only focus on equivalent properties, the fine-grained property alignment method can improve the precision in recognizing equivalent properties, and recognize subsumptive and relevant properties, proving its effectiveness on Chinese open datasets.
      作者: 黄廷磊 [1] 张伟莉 [2] 梁霄 [1] 付琨 [1]
      Author: Huang Tinglei [1] Zhang Weili [2] Liang Xiao [1] Fu Kun [1]
      作者单位: 中科院空间信息处理与应用系统技术重点实验室, 北京 100190;中国科学院电子学研究所, 北京 100190 中科院空间信息处理与应用系统技术重点实验室, 北京 100190;中国科学院电子学研究所, 北京 100190;中国科学院大学, 北京 100049
      年,卷(期): 2017, 47(4)
      分类号: TP182
      在线出版日期: 2017年8月15日
      基金项目: 国家高技术研究发展计划(863计划)资助项目