《获取和使用网络抓取数据的质量指南.pdf》由会员分享,可在线阅读,更多相关《获取和使用网络抓取数据的质量指南.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Web Intelligence Network ConferenceFrom Web to Data4-5 February 2025 GDANSK-POLAND Quality Guidelines for acquiring and using web scraped dataESSnet WIN,WP4Magdalena Six,Alexander Kowarik,Manveer Mangat,Johannes Gussenbauer(AUT)Outlineo Organisational backgroundo Statistical production process incl.
2、web-datao Theoretical Framework for Landscapingo Examples of quality guidelines in the throughput phaseo Guidelines for a centralized webscraping platformOrganisational backgroundSubgroups of WP4 of ESSnet WINMethodologyDeliverable 4.6:WP4 Methodology report on using webscraped dataArchitecture Deli
3、verable D4.7:BREAL-Big Data REference Architecture and Layers for web scraped dataQuality Deliverable 4.5:Quality Guidelines for acquiring and using web scraped dataQuality Assessment Deliverable 4.8:Quality Assessment for the Statistical Use of Web Scraped DataAll deliverables of WP4 at https:/ pro
4、cesses along the production processQuality-relevant processes along the production processSpotlight:LandscapingDefinition:Landscaping refers to the cataloguing and measurement of all web-based data sources relevant for the topic of interest.The effort of landscaping varies depending on the topic of
5、interest:All needed data might be available on one websiteExample:satellite dataThe great extent of existing websites and the impossibility to scrape and combine them all makes it necessary to select websitesExamples:online job advertisements,real estate prices or price statisticsAll websites w.r.t.
6、topic of interest should be scraped,combination of ingested information is possibleExample:enterprise characteristicsLandscaping:Selection of websitesWhich websites to scrape?-Most important ones?Highest quality?-Score is neededThree groups of information to take into account:Information from the we