Automating Multilingual SDG Event Extraction from Regional Portals Using Web Scraping and LangChain Frameworks
Abstract
Bhawna Singla and Neha Bansal
This research presents a novel, scalable, and multilingual data extraction framework designed specifically to collect and structure Sustainable Development Goal (SDG) event information from a wide range of regional portals and language-specific SDG websites. As SDG-related activities are increasingly being organized and reported by diverse stakeholders across the globe—ranging from local governments to international NGOs—event data is often dispersed across decentralized platforms, published in different languages, and presented in unstructured or semi-structured formats. Traditional data collection methods struggle to keep up with the volume, variability, and linguistic diversity of such data sources.
To address these challenges, this study leverages a hybrid approach that combines web scraping techniques with the LangChain framework, which allows seamless integration of large language models (LLMs) for downstream natural language understanding tasks. The proposed automated pipeline performs end-to-end data extraction: it first scrapes event content from HTML pages, detects the source language, applies automatic translation (when necessary), and then uses prompt-based LLM reasoning to extract key event attributes (e.g., title, date, location, thematic focus).
This approach not only accelerates the process of collecting and curating SDG event data but also ensures cross-lingual scalability and adaptability to region-specific formats. By enabling structured data extraction from multilingual and heterogeneous sources, the framework contributes to creating a more unified and comprehensive dataset of global SDG activities. Ultimately, this work underscores the critical role that AI-enhanced data pipelines can play in supporting evidence-based policy-making, enhancing transparency, and enabling real-time monitoring of progress toward the 2030 Agenda for Sustainable Development.