Since Internet lives in its own limitless world where everybody is allowed to join and contribute, all information placed on the web is growing every second. It is full of various data, like daily newspapers, discussion forums, shop catalogs, images or videos, which are accessible by almost anyone. The automatic data extraction represents huge nowadays problem.
Web pages containing huge amount of information are designed for human readers; and it makes their automatic computer processing difficult. Moreover web pages live – their content is changing. The complexity of development of such application is enormous since the nature of data does not conform to common programming paradigms.
Our slaves involved in this problematic are continually inventing different methods of performing these tasks even more easier and faster. Several projects related to web data extraction and web semantization have been designed, developed or participated – originally on premises of university. Currently there are several projects publicly available:
LinqToWeb is a framework for web data extraction. It is designed in an innovative way that allows defining strongly typed object model transparently reflecting data on the living web. This mechanism provides access to raw web data in a completely object oriented way using modern techniques of Language Integrated Query (LINQ). Using this framework, development of web-based applications such as data semantization tools is more efficient, type-safe, and the resulting product is easily maintainable and extendable.
Thanks to LinqToWeb by using .NET and Language Integrated Query you can easily access information from various sources. The LinqToWeb project is focused on generating abstraction over web resources, generating .NET proxy classes and allowing to use LINQ for reading web in strongly typed object oriented way. The project development status is available on linqtoweb.codeplex.com. For anybody who would like to integrate LinqToWeb into his .NET application, please feel free to contact us.
- Jakub Míšek, Filip Zavoral:
High-Level Web Data Abstraction Using Language Integrated Query,
Intelligent Distributed Computing IV – IDC 2010, Studies in Computational Intelligence 315, Springer Verlag, 2010
- Miloslav Beňo, Jakub Míšek, Filip Zavoral:
AgentMat: Framework for Data Scraping and Semantization,
Proceedings of the Third International Conference on Research Challenges in Information Science, IEEE Computer Society Press, 2009
Share this page
- Phalanger 3.0 updates for March 2013 March 6, 2013After several months of development, contributions from opensource community and collaboration with big commercial users, Phalanger is getting bigger. Today we’ve released package of Phalanger, containing many new extensions and latest integration for Visual Studio. New goodies in Phalanger Mainly … Continue reading → […]
- Announcing WP.NET May 23, 2012Phalanger was capable of compiling WordPress for quite a long time, but this support was always targeted at developer audience. Experienced developers could compile WordPress and run it on their servers. However, we noticed that the combination is interesting not only to developers, … Continue reading → […]