1、Selective scraping,sampling and other methods to minimize known causes of biases of web dataWeb Intelligence Network ConferenceAlexander Kowarik,Piet Daas05 February 2025Trusted Smart Statistics Web Intelligence NetworkOverview Sampling in the Context of Webscraped Statistics Methods specific to web
2、scraped data and causes of bias Co-financed by Web Intelligence Network:101035829 2020-PL-SmartStat Contributions to deliverables by several colleagues:Olav ten Bosch,Jacek Maslankowski,Magdalena Six,Johannes Gussenbauer,Sonia Quaresma and moreAll deliverables of WP4 at https:/ Memoriam:Prof.dr.Piet
3、 Daas-Methodology lead and-Main author of“Deliverable 4.6:WP4 Methodology report on using webscraped data”on which this presentation is based.Sampling what forSampling for Quality AssessmentEstimation:Probability and Non-Probability SamplingMethodology for estimation and error estimation very wellde
4、veloped and we do know sampling methodologySelective ScrapingOptimized Scraping StrategySampling for Quality Assessment Why Sampling Matters in Quality Assessment:Labor-intensive nature of manual annotation.Need for high-quality,representative annotated datasets.Optimization StrategiesReducing annot
5、ation volume with strategic sampling.Ensuring representative marginal distributions.More on this in the deliverableProbability Sampling Probability sampling if the process of deriving a target variable,is not easily scalable e.g.a statistical classification needs costly manual intervention The situa
6、tion is thus similar to a survey where each interview has a high cost and cannot be extended easily to the full population.There is a rich body of methodology developed for inference from random samples from a method for the sampling design and the applied estimation can be selected.Non-Probability