当前位置:首页 > 报告详情

评估从网络数据中获取的企业特征和在线招聘广告分类的质量.pdf

上传人: Fl****zo 编号:718627 2025-06-22 19页 773.92KB

1、Assessing the Quality of Enterprise Characteristics and Online Job Advertisements derived from Web DataVille Auno,Statistics FinlandJohannes Gussenbauer,Statistics AustriaWIN Conference,Gdansk 06/02/2025Trusted Smart Statistics Web Intelligence NetworkIntroduction Assessing the quality and usability

2、 of web scraped data for officialstatistics production was one of the tasks carried out in the Web Intelligence Network(WIN)project Focus on two different data:Open Job Advertisements(OJA)Online-Based Enterprise Characteristics(OBEC)Findings provide insights into the challenges and strengths of web

3、scraped dataQuality Assessment of OJA Data Quality of OJA data was assessed with two different ways:Use of pre-defined quality indicators for source evaluation Manual annotation exercises for evaluating classification accuracy Quality indicators:Number of relevant(500 OJAs)and very relevant(5000 OJA

4、s)sources overtime Ranking of the relevant sources over time Time series plots for number of OJAs for all very relevant sources Stability of data over different versions of dataOJA:Quality indicators Relevant sources Fairly stable Some fluctuation in Portugal for example Very relevant sources Simila

5、r with relevantsources Larger fluctuations in relative terms in smallercountriesYearATBGDEFIFRITNLPLPTRO20181973963827231251420191816441044322914102420202116405413126122516202127215494734341444212022202161756313615511920231516435422826154414YearATBGDEFIFRITNLPLPTRO20187127224169722201913531625231511

6、411202094284221712810720218428425211581482022752633119141116920234421326171311126OJA:Quality indicators Stability of the relevant and veryrelevant sources were analyzedfurther:Very relevant sources do not remainthe same over the years Relative significance of the sourcesvary from year to year source

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
本文研究了网络爬取数据在官方统计生产中的质量与可用性,主要关注两种数据:开放职位广告(OJA)和企业在线特征(OBEC)。关键点如下: 1. OJA数据质量评估:通过预定义的质量指标和手动标注练习评估,发现数据源稳定性令人担忧,分类准确性不够理想。例如,1位数分类的最高准确率为波兰71.07%,最低为立陶宛48.28%。 2. OBEC数据质量评估:自动和手动方法检测企业特征(如网址、社交媒体链接、在线商店)的准确性较高,尤其是网址链接。 - OJA数据源的稳定性和相关度随时间变化,数据源的相对重要性逐年变化。 - 手动标注显示OJA数据分类准确性不高,不适合直接用于官方统计。 - OBEC数据在网址链接、社交媒体链接和电子商务方面的准确性相对较高。 综上,网络爬取数据在用于官方统计时应谨慎处理,且各国实施质量存在显著差异。存在将各国最佳实践整合为单一软件包的潜力。
"OJA数据稳定性如何?" "网页抓取数据适合官方统计吗?" "OBEC数据准确性有多高?"
客服
商务合作
小程序
服务号
折叠