Bottom-Up Wrapper is completely unsupervised information extraction system for extracting the list of data records from the semi-structured web pages. Generally, data records in a semi-structured web page, e.g., lists of products or services are generated from databases and usually encoded into the HTML with fixed templates or layouts by server-side scripts. However, these data records are represented without the structural information, which is not appropriate for software tools to access them as structural data. In this website, we present a novel technique to extract data records from the semi-structured web pages. While, many existing techniques are top-down approach that they start by identifying data region in the web page, discovering the pattern of data records in the data region, and aligning these records to extract data items. In another way, the Bottom-Up Wrapper figured out the stated problem in bottom-up way that it starts by discovering the repetitive pattern of data items, using these patterns for identify data records, and identifying that relevant data region at the final. As the result, this technique requires only one input page, and it is completely unsupervised wrapper.
- Video Clips
- Live Demos: Try it yourself, for extracting the list of data records from the web pages.
- Thamviset, W., Wongthanavasu, S.: Bottom-Up Region Extractor for Semi-Structured Web Pages.The 2014 International Computer Science and Engineering Conference (ICSEC2014).
Digital Object Identifier [10.1109/ICSEC.2014.6978209]
- Thamviset, W., Wongthanavasu, S.: Information Extraction for Deep Web Using Repetitive Subject Pattern. World Wide Web. 1 - 31 (2013).
Digital Object Identifier [10.1007/s11280-013-0248-y]
- Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1-4 (2012).
Digital Object Identifier [10.1109/ECTICon.2012.6254247]
- Video Clips
Bottom-Up Wrapper: Unsupervised Information Extraction System