Web semantization

Since Internet lives in its own limitless world where everybody is allowed to join and contribute, all information placed on the web is growing every second. It is full of various data, like daily newspapers, discussion forums, shop catalogs, images or videos, which are accessible by almost anyone. The automatic data extraction represents huge nowadays problem.

Web pages containing huge amount of information are designed for human readers; and it makes their automatic computer processing difficult. Moreover web pages live – their content is changing. The complexity of development of such application is enormous since the nature of data does not conform to common programming paradigms.

Our slaves involved in this problematic are continually inventing different methods of performing these tasks even more easier and faster. Several projects related to web data extraction and web semantization have been designed, developed or participated – originally on premises of university. Currently there are several projects publicly available:

LinqToWeb

LinqToWeb is a framework for web data extraction. It is designed in an innovative way that allows defining strongly typed object model transparently reflecting data on the living web. This mechanism provides access to raw web data in a completely object oriented way using modern techniques of Language Integrated Query (LINQ). Using this framework, development of web-based applications such as data semantization tools is more efficient, type-safe, and the resulting product is easily maintainable and extendable.
Thanks to LinqToWeb by using .NET and Language Integrated Query you can easily access information from various sources. The LinqToWeb project is focused on generating abstraction over web resources, generating .NET proxy classes and allowing to use LINQ for reading web in strongly typed object oriented way. The project development status is available on linqtoweb.codeplex.com. For anybody who would like to integrate LinqToWeb into his .NET application, please feel free to contact us.

AgentMat

AgentMat system is designed for efficient extraction of large amount of data from the web pages. AgentMat processing is based on an XML-based language describing the given extraction task in a declarative way. The task description consists of system components, which connected together are able to perform the desired functionality on a general web page. Thanks to this scraping system the raw contents from the irregularly updated and unstructured web pages can be kept categorized and accessed together with the semantic metadata.

Publications

  • Jakub Míšek, Filip Zavoral:
    High-Level Web Data Abstraction Using Language Integrated Query,
    Intelligent Distributed Computing IV – IDC 2010, Studies in Computational Intelligence 315, Springer Verlag, 2010
  • Miloslav Beňo, Jakub Míšek, Filip Zavoral:
    AgentMat: Framework for Data Scraping and Semantization,
    Proceedings of the Third International Conference on Research Challenges in Information Science, IEEE Computer Society Press, 2009

Share this page

RSS php-compiler.net

  • WordPress on .NET with SQL Server is Possimpible using Phalanger 3.0 January 23, 2012
    In a recent scenario I wanted to run WordPress as a subdirectory of a .NET application. I also wanted to avoid installing PHP and MySql on the Windows server. Impossible? Apparently not! (I’ll get to the word Possimpible a bit … Continue reading → […]
  • Phalanger riding Mono January 17, 2012
    Phalanger is a complete reimplementation of PHP, written in the C# language. It was always being developed with the Mono platform in mind. This means you can compile and run PHP application on Linux web servers using Mono. Since Phalanger 3.0, … Continue reading → […]