Introduction to the Semantic Web Technologies
The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation .
For newcomers to the Semantic Web, the above definition taken from the article, which is often taken as the starting point for the research area, is as good a starting point as any. The goal of the Semantic Web is in some sense a counterpoint to the Web of 2001. That Web was designed as a global document repository with very easy routes to access, publish, and link documents, and Web documents were created to be accessed and read by humans.
The Semantic Web is a machine-readable Web. As implied above, a machine-readable Web facilitates human-computer cooperation. As appropriate and required, certain classes of tasks can be delegated to machines and therefore processed automatically. Of course, the design possibilities for a machine-readable Web are very large, and a number of design decisions were taken in developing the Semantic Web as it is seen today. The trade-offs in the design space are discussed later on in this chapter and also in the rest of the book. Two of the most significant are worth mentioning up front though. Firstly, as captured in the quote above, the Semantic Web is an extension of the Web. In particular, the Semantic Web builds upon the principles and technologies of the Web. It reuses the Web's global indexing and naming scheme, and Semantic Web documents can be accessed through standard Web browsers as well as through semantically aware applications. A global naming scheme means that in principle every semantic concept has a unique identifier, although in practice identity resolution is still a research area and the Semantic Web language OWL contains a specific relation to deal with this issue.
A second design choice is related to the fact that the Web is a shared resource, and therefore, within a machine-readable Web, meaning should be shared too. To this end, the Semantic Web incorporates the notion of an ontology, which by definition is a shared machine-readable representation (see Sect. 1.3.6). Through ontologies and ontology-related technologies, the meaning of and relationships between concepts within published Web pages can be processed and understood by software-based reasoners.
After about a decade of dedicated Semantic Web research, we are now entering a new phase for the technology. In short, it can now be claimed that the Semantic Web has arrived. There are a number of indicators to this. For example, semantic search engines now claim to index many millions of Semantic Web documents. Of course, this number of documents is small when compared to the size of the overall Web, but the trend resembles the early days of the Web, and if one counts the contained semantic statements (triples - see Sect. 1.3.4), then the number is estimated to be over a hundred billion triples.
Later in this chapter and also in most of the other chapters of this book, evidence is given to the take-up of Semantic Web technology. Semantics can be seen being deployed in a wide variety of settings including enterprise, government, media, and science arenas. We are thus at a tipping point in the timeline of the Semantic Web where the technology can be seen to be moving out of research labs and into the mainstream in a nontrivial fashion.
To mark this juncture, this book describes the main technological components of the Semantic Web, the vertical areas in which the technology is being applied, and new trends in the medium and the long term. Each chapter covers general scientific and technical principles and also gives examples of application and pointers to relevant resources.
The rest of this chapter gives an introductory account of the notions of the Web and semantics from a technical perspective. Also, a brief history of the research area is discussed, given pointers to a number of general Semantic Web resources, and some highlights in terms of the deployment of semantic technology are outlined. The final section contains pointers to the future of the topic in general terms.
What Is the Web?
With over one trillion pages and billions of users, the Web is one of the most successful engineering artifacts ever created. At the end of 2009, there were 234 million websites of which 47 million were added in the year. The Web is now a rich media repository: the current upload to Flickr is equivalent to 30 billion new photos per year and YouTube now serves over one billion videos per day .
The Problem to Be Solved
- The projects carried out were large and complex involving several different types of technologies.
- Work was carried by teams, which crossed CERN's specified departments and unit structures.
- The knowledge involved was not static but rather changed over time.
- There was a rotation of staff. Workers came and went periodically - the typical length of stay at CERN at the time was 2 years.
- Workers needed to be able to easily find and access relevant documents containing technical knowledge.
- The content of the documents needed to be easily changeable and the changes propagated across the organization quickly.
- The structure of the document collection could not be predetermined and had to be adapted easily.
CERN meets now some problems which the rest of the world will have to face soon .
Principles of the Web
As succinctly coined in the phrase: "For a hammer everything is a nail" (originally from [43, p. 15]), one has to be careful when differentiating between technological biases and the true underlying principles for any generic framework. Nevertheless, a significant portion of the design of the Web is based upon Hypertext, which was originally coined as a term by Ted Nelson  and has roots going back to Doug Engelbart's oNLine System  and Vannevar Bush's Memex system . Another stream of innovation for the Web is based upon communication protocols, notably TCP-IP, a spin-off of TCP , which provides the bottom layer of the communication protocol for the Web.
- Openness Anyone or any organization can engage with the Web as a provider or consumer of information. Openness is an essential criterion for the success of the Web as a platform and incorporates:
- Accessibility Web content can be accessed remotely from a wide variety of hardware and software platforms.
- Nonproprietary The Web itself is not owned by any individual or organization, minimizing the effect cost has on participating.
- Consensual control The Web structure is itself controlled and managed by an open body, the World Wide Web Consortium (W3C), which has a well-defined consensual process model for decision making.
- Usable Usage of this infrastructure as a provider or user is kept as simple, smooth, and unrestricted as possible.
- Interoperability The Web is neutral to hardware and software platforms. A layer of protocols provides an integration mechanism, enabling heterogeneous proprietary and legacy solutions to interoperate through common interfaces.
- Decentralized authorship and editorship Content can appear, becoming modified, or be removed in a noncontrolled fashion. That is, the provisioning and modification of content is under the distributed control of the peers rather than being controlled by a central authority. Central control would hamper access and therefore scalability. A consequence of this principle is that an element of chaos or "untidiness" needs to be tolerated. It is hard to imagine now, but in the early days of the Web one of the most common criticisms was that it would never take off because some Web pages could be found that were either incorrect or were below some quality threshold and also that some links were broken (two of the editors know of Computer Science professors who made this complaint).
- Automated mechanisms are provided to route requests and responses In order to scale, routing between requests and responses is handled in an automated fashion. Manual indexes or repositories are inherently nonscalable and costly, and immediately become outdated. The way that Web pages are accessed has changed over the past 10 years. At the beginning, one was required to know the IP-Address of the desired page and then later the URL (see below for a description). In this period, bookmark lists (especially lists of useful pages for a particular topic) were considered valuable intellectual property. Later, search engines such as AltaVista and Google raised access to the level of keywords.
- Enabling n:m relationships to maximize interaction. In contrast to email, where the content is targeted to specific receivers, the Web is based on anonymous distribution through publication. In principle, the information is disseminated to any potential reader, something that e-mail can only attempt to achieve through spam. The use of content for purposes not perceived by content producers facilitates serendipity on the Web and is one of the Web's key success enablers.
- A worldwide addressing schema, which enables every document to have a unique globally addressable identifier. For the Web, this is provided by URLs (Uniform Resource Locators). A URL serves the purposes of both identifying a resource and also describing its network location so that it can be found. URIs (Uniform Resource Identifiers) encompass both URLs and URNs (Uniform Resource Names), where URNs denote the name of a resource.
- A transport layer, a protocol, HTTP (HyperText Transfer Protocol), which supports the remote access to content over a network layer (TCP-IP). HTTP functions as a request-response protocol in a client-server computing model. In HTTP, a Web browser typically acts as a client, while an application running on a computer host acts as a server.
- A platform-independent interface, which enables users to easily access any online resource. In case of the Web, it is HTML (HyperText Markup Language) and Web browsers that interpret and display the described content. HTML is thus a text and image formatting language, which is remotely served by Web host applications and used by Web browsers to display the Web content.
Integral to the makeup of the Web is the hyperlink which has its origins in the hypertext field. Hyperlinks allow a Web resource to point to any other Web resource by embedding the URL within an HTML construct (the "<a>" or anchor element). Links on the Web are unidirectional and are not verified, which means that links may break - the target Web resource may have been removed or the URL itself may be incorrect - leading to the "untidiness" mentioned earlier. However, not forcing links to be verified is widely accepted as being one of the design choices that enabled the Web to scale so quickly.
What Are the Problems with the Web?
The amount of information on the Web is staggering. The one trillion Web resources encompass practically every topic of human interest: from the life cycle of earthworms in New Zealand , to UK Pop Hits in the 1950s , to the Constitution of Mauritius .
- Accessing data - the "standard Web" is limited in that:
- Documents are indexed and accessed via plain text, that is, a string-based matching algorithm is used to retrieve documents according to a given request. This creates problems for ambiguous terms, for example, "Paris" can denote: the capital of France; towns in Canada, Kiribati, and the USA; a number of films including "Paris, Texas" by Wim Wenders; fictional characters including the legendary figure from the Trojan War; and a number of celebrities including the daughter of Michael Jackson, and Paris Hilton the socialite and heiress. Moreover, complex matching involving inference is not feasible without additional technology. For example, correctly answering the query: "where can I go on holiday next week for 10 days with two young children for less than 1000 Euros in total?" is not possible with current search engines.
- The current paradigm is dominated by returning single "best fit" documents for a search. Often, the answer to a query is available on the Web but requires the combination and integration of the content of multiple source documents. The dominant search engines today leave this integration of content to the user.
- Underlying data are not available. A significant number of websites are generated through databases but the underlying data are hidden behind the presented HTML. This phenomenon is sometimes termed "the dark Web" and significantly hinders the usability and reusability of the underlying information. A way to overcome this problem is to "Web scrape" the data by parsing the presented HTML. This process though is error-prone and unstable with regard to changes in the way the page is displayed (e.g., if the layout or color scheme is altered). It should be noted that the concept of making legacy database data available was specified as a requirement in the original proposal from Sir Tim Berners-Lee.
- Enabling delegation - the Web can be viewed as a very large collection of static documents. When users browse the Web, their computers act simply as rendering devices displaying text and graphics and sometimes audio and video content. All inference and computation is left to the user. To a large extent, the computational abilities of the computational device are not used. Coupled with the above ones on users to carry out their own inferences, the sheer volume and growth of data available creates a strong need for at least some level of automation. For example, current estimates are that the 281 exabytes (106 TB) of information created or replicated worldwide in 2007 will grow tenfold by 2011 to 1 zettabyte (109 TB) per year. Delegating tasks such as the integration of information, data analysis, and sense-making to machines, at least partially, is the only way forward for users, communities, and businesses to continue to make the most of the information available on the Web.
Given the above requirements, the Semantic Web extends the Web with "meaning" supporting access to data at web-scale and enabling the delegation of certain classes of tasks. As the Web has documents at the center, the Semantic Web places data and the semantics of data at its core. An overview of the architecture of the Semantic Web is given in Semantic Web Architecture .
What Are Semantics?
Knowledge of the specific task domain in which the program is to do its problem solving was more important as a source of power for competent problem solving than the reasoning method employed .
All these areas of Computer Science focus on capturing the meaning of data in a machine-processable manner and provide the historical context from which semantic technology was developed. The following briefly discusses the essential essence of semantic technology, as well as its form and substance.
Semantics, the Science of (Meaning)2
Semantic technology provides machine-understandable (or better machine-processable) descriptions of data, programs, and infrastructure, enabling computers to reflect on these artifacts. Now, what does machine-processable semantics really mean? Let us ask Wikipedia, the world leading resource of human knowledge. Let us specifically ask for machine-processable semantics. Unfortunately, there is no direct response. Okay let us ask for its three elements.
- "In all such energy transformation processes, the total energy remains the same" . What was meant by consuming energy? "Energy is a quantity that can be assigned to every particle" . Here, proceedings become a bit philosophical. Trying to find out what a quantity is and why it is that it can be assigned to all particles will be resisted. Not to mention that the notion of an assignment should really be investigated and delved into whether particles or waves are the final truth? It does not really help to distinguish between a machine and a device. That is, machines remain defined as being machines (more precisely, it is learnt only that basic machines are simple machines).
- "Purpose is a result, end, aim, or goal of an action intentionally undertaken" . So what is an intention? "An agent's intention in performing an action is his or her specific purpose" . No, there will be no attempt to find out what an agent is.
Processable does not have a hit at Wikipedia. This saves both time and space.
"Semantics is the study of meaning,… This problem of understanding has been the subject of many formal inquiries… most notably in the field of formal semantics" . Also from the same source: "The word 'semantics' itself denotes a range of ideas." Fortunately only the word. And no, we will not try to understand what an idea is, since already in the narrowest sense "an idea is just whatever is before the mind when one thinks" . Let us try to find out the meaning of formal semantics: "Formal semantics is the study of the semantics" . Okay, formal semantics is the study of semantics and semantics is the study of meaning. Obviously meaning is the study of? No, meaning "is the end, purpose, or significance of something" . So, formal semantics is the study of the study of purpose. Purpose is to remember the attribute used to distinguish a device from generic equipment (which is a machine if it consumes energy).
Naively entered here is an infinite regression of circular definitions written in natural language. This would be an opportune moment to refer to the importance of cooperation as a grounding mechanism for communication and to conduct a detailed analysis of the role of vocal and nonvocal communication mechanisms (cf. [45, 56]) in order to escape this infinite regress. However, this is not the focus here. Obviously, life is a circle and one needs to be pragmatic. Let us try to understand the essence of semantic technology through its usage starting with a number of predecessor technologies.
What is the main value of a traditional relational database? According to Wikipedia, "a database is a collection of data" and "the term data means groups of information" … "Information as a concept has many meanings …" The authors do not tell us whether information that is not viewed as a concept would have less meaning. According to Wikipedia, meaning also has many meanings. Still, Oracle is able to successfully sell bases of collections of groups of information that have many meanings when viewed as a concept not mentioning the fact that already meaning has many meanings. Moreover, Oracle makes billions of dollars per annum with this kind of rather vague business.
In a relational database, everything is represented in a table, and a row has a key and a column has a name. With this, even with a very simple machine, one can find the phone number of Mr. X if X is the value of the name column and phone number is the heading of another column. Unfortunately, with an average Web page, this is far more difficult. As mentioned earlier, hidden in various HTML tags there is a name (a random alphanumerical string similar to many others) and somewhere else a phone number (a set of integers including some special characters). A browser is required to render the information and a human reader to understand the information based on the layout of the website. This is the solution as implemented in the Web which was introduced 20 years ago. As outlined earlier, the sheer simplicity has made the Web an incredible success story with now more than one billion users. Its simplicity also leaves room for improvement.
Semantic technology adds tags to semistructured information as database technology adds column headings to tabular information. Let us use a small example:
<phone number>01-444444</phone number>
These annotations allow a computer "to understand" that Sir Tim is a name of a person and 01-444444 is his telephone number. In a similar fashion, programs and other computational resources can be described through semantic annotations. This is the essence of Semantic Web technology.
What can be seen from this example is that one needs two things to define the semantics of information: a language such as <X >Y</X> to define the meaning of Y, and terms such as X to denote this meaning. This is investigated in more detail in the following.
Logic is a 1,000-year-old technology to formally capture meaning. Over this long history, especially relatively recently, a large number of logics have been developed, each suitable for a specific purpose. The focus is on a small number of these languages, in particular, on those that provide insights into the overall design issues associated with logical languages and those that have been applied in a Semantic Web context. A number of languages will be then examined that are used to express the meaning of data on the Semantic Web. Finally, there would be a discussion on open issues and problems when applying logic to the Web.
From an algorithmic perspective, implementing logical-reasoning systems demonstrates clearly how complex decidability and complexity are to manage (cf. [29, 35]). First, briefly described are logical paradigms in increasing levels of complexity, and then, how computer scientists identified reasonable subsets which can be handled to a certain extent.
Propositional Logic is a rather simple logic language providing propositions such as A, B, C,… and logic connectives such as AND and OR. All interpretations are simply the enumerations of all possible false and true assignments to these propositions. Therefore, propositional logic is decidable, although, already NP-hard.
First-Order Predicate Logic provides a richer means to define such propositions by providing terms such as c, f(c, X),… and predicate symbols that can be applied to these terms P(c), Q(c, f(c,X),…. Terms can make use of variables that can be existentially or all quantified (i.e., either there must exist a term fulfilling a formula or all terms must fulfill a formula). First-order predicate logic is still semi-decidable. That means, there are complete and correct evaluation methods; however, it is not possible to guarantee that they terminate. An important feature of first-order logic is the distinction between terms and predicates, that is, one is not allowed to apply predicates or terms to predicates.
Second-Order Predicate Logic  and comparable languages drop this limitation (cf. ). Here, one can apply predicates to other predicates or entire formulae and interpret variables as sets rather than as individuals of a domain of interpretation. Unfortunately, for these languages, already unification, that is, the question of whether two terms can be substituted, is semi-decidable, which means that there is not even an approach for implementing inference in these languages. The question of how far one can make progress in simulating second-order features syntactically (statements over statements or classes that can be instances of other classes) in a semantic first-order framework has been explored in F-Logic (cf. ) and more generally in HiLog (cf. ).)
In layman terms, propositional logic is reasoning about individuals. It is decidable but the effort grows exponentially with the number of individuals. First-order logic is reasoning over sets of individuals (each predicate is interpreted as a set), which is complete but does not guarantee a terminating decision procedure. Second-order logic is concerned with reasoning about sets that have elements which may again be sets. The focus of computational logic is on identifying subsets of logic that can be handled by computers. Unfortunately, what one gets here is not necessarily what one would need.
Most approaches in automatic theorem proving and software verification use variants of first-order logic to reason (cf. ). Here, based on the transformation of the general clause form, resolution and unification (cf. ) provide a complete although only semi-decidable decision procedure. Obviously, for this level of expressiveness, only incomplete reasoning requiring heuristic guidance can be achieved in the general case.
A restriction of the pragmatic complexity can be achieved by restricting first-order logic to Horn logic and applying Selective Linear Definite resolution . There are also variants that forbid or cleverly restrict the usage of function symbols creating a decidable language - propositional logic with some additional syntactical sugar. Most work on Horn logic alters the model theory of logic by not considering all models but models that are defined through certain minimality criteria (this model is unique in the case where negation in the bodies of the Horn clauses is either restricted or does not exist, cf. ). In layman terms, this model assumes that only facts which can be inferred are true and that all other facts are false. This is called the closed-world assumption and originates from the database area. A well-known implementation of this paradigm is Prolog (cf. ). Interestingly enough, this paradigm extends the expressiveness of these syntactically restricted first-order languages beyond first-order logic as it becomes possible to express the transitive closure of a relationship.
Description Logics (cf. ) provide a whole family of sub-languages of first-order logic of differing complexity. Common among these languages is to restrict the formalism to unary and binary predicates (concepts and properties) and to restrict the usage of function symbols and logical connectors to build complex formulae. The different levels of complexity and the decidability of these languages follow from the precise definition of these restrictions. Therefore, many different languages have been defined and implemented, many of which contain intractable worst-case behavior but which however still work for many practical applications (cf. ).
Semantic Web Languages
HTML provides a number of ways to express the semantics of data. An obvious one is the META tag :
<META name = "Author" lang= "fr" content = "Arnaud Le Hors">
In the time before the wider usage of RDF, systems such as Ontobroker (cf. ) used the attribute of the anchor tag to encode semantic information (see the Sect. 1.5). It is also possible to interpret the semantics of HTML documents indirectly. For example, information captured in a heading tag of level one (<H1>) may be used to encode concepts that are significantly important for describing the content of a document. Still, HTML was not designed to provide descriptions of documents beyond that of informing the browser on how to render the contents. Within efforts to stretch the use of HTML to include meaning, the term semantic HTML was created - see  for more details on this.
The Extensible Markup Language (XML)  has been developed as a generic way to structure documents on the Web. It generalizes HTML by allowing user-defined tags. This flexibility of XML, however, reduces the possibilities for the type of semantic interpretation that was possible with the predefined tags of HTML.
The Resource Description Framework (RDF) (cf. ) is a simple data model for semantically describing resources on the Web. Binary properties interlink terms forming a directed graph. These terms as well as the properties are described using URIs. Since a property can be a URI, it can again be used as a term interlinked to another property. That is, unlike most logical languages or databases, it is not possible to distinguish the language or schema from statements in the language or schema. For example, in the statement <rdf:type, rdf:type, rdf:Property> it is stated that type is of type property. Also, unlike conventional hypertext, in RDF, URIs can refer to any identifiable thing (e.g., a person, vehicle, business, or event). This very flexible data model is obviously suitable in the context of a free and open Web; however, it generates quite a headache for logicians who wish to layer a language on top. More details on RDF can be found in  and in Semantic Annotation and Retrieval: RDF.
RDF schema (RDFS) (cf. ) uses basic RDF statements and defines a simple ontology language. Specifically, it defines entities such as rdfs:class, rdfs:subclass, rdfs:subproperty, rdfs:domain, and rdfs:range, enabling one to model classes, properties with domain and range restrictions, and hierarchies of classes and properties. RDFS is a specific RDF vocabulary for this purpose and is simply RDF plus some more definitions (statements) in RDF.
The Web Ontology Language OWL (cf. ) extends this vocabulary to a full-fledged spectrum of Descriptions Logics defined in RDF, namely, OWL Lite, OWL DL, and OWL Full. Mechanisms are provided to define properties to be inverse, transitive, symmetric, or functional. Properties can be used to define the membership of instances for classes or hierarchies of classes and of properties. Frankly, OWL Lite is already quite an expressive Description Logic which makes the development of efficient implementations for large data sets quite challenging and, in practice, as difficult as implementing OWL DL. However, neither of these languages can make use of full RDF, that is, some valid RDF statements are not valid in Lite or DL. This is due to the fact that logic languages such as Descriptions Logics exclude meta statements, that is, statements over statements. For RDF and RDFS, this was not a problem since neither language provided mechanisms to define complex logical definitions. Spoken in a nutshell, Lite and DL define a vocabulary in RDF and restrict the usage of RDF. OWL Full drops these restrictions. OWL Full provides the vocabulary of OWL DL, that is, an expressive Description Logic, and allows for any valid RDF statement. For example, in OWL Full, a class can be treated simultaneously as a set of individuals and as an individual. Therefore, OWL Full is beyond the expressive scope of Description Logic and minimally requires a theorem prover type of inference such as first-order logic (i.e., is semi-decidable).
Still, OWL Full can be used as a basis to find useful restrictions (OWL DL is an example of such a restriction) and generate useful languages such as the Simple Knowledge Organization System (SKOS) (cf. ). SKOS is a data model for knowledge organization systems that uses keywords to describe resources. SKOS is defined as an OWL Full ontology, that is, it uses a sub-vocabulary of OWL Full to define a vocabulary for simple resource descriptions based on controlled structured vocabularies.
OWL2 (cf. ) started in 2007 to address some of the issues around OWL. In particular, OWL Lite had been defined as an overexpressive Description Logic. This hampered the implementation of Lite reasoning based on existing semantic repository technologies and also made the layering of rules on top of the language unfeasible. Specifically, there was too big a gap between RDFS and OWL Lite. In consequence, three new sub-languages were defined. OWL2EL provides polynomial time algorithms for all the standard reasoning tasks of description logic, OWL2QL enables efficient query answering over large instance populations, and OWL2RL restricts the expressiveness with respect to extensibility toward rule languages. OWL2RL seamlessly links with rule-based presentations of RDFS and extensions to simple rule languages (cf. , ). This is currently the route that most industrial semantic repository developers follow and will probably define together with OWL2QL the most important Semantic Web representation languages from a technological point of view.
The Rule Interchange Format (RIF) (cf. ) complements OWL with a language framework centered on the rule paradigm. Like OWL, it does not come as a single language but as a number of sub-languages. The framework incorporates RIF-BLD, which defines a simple logic-oriented rule language; RIF-PRD, which captures most of the aspects of production rule systems; and RIF-Core, which is the intersection of both these languages. This split is due to the fact that the W3C working group had to cover two very different paradigms which are only similar at the surface level: rules based on a declarative interpretation of logic (cf. ) and rules that model event-action systems based on the production rule paradigm (cf. ). The former usually have a declarative semantics in terms of a variation of a minimal Herbrand model and were an alternative model for databases called deductive databases. The latter normally only have an operational semantics and are used to express the dynamic aspects of processes. Production rules are in essence a kind of programming language based on a blackboard architecture and event triggers. Since these production systems are no longer called expert system shells but business rule engines (suitable to implement business processes), they have gained significant commercial interest. Creating a merger of these two different paradigms was a nontrivial task. Finally, these three dialects are complemented by The Framework for Logic dialects (RIF-FLD) as a way to define new RIF dialects. RIF uses XML as the exchange syntax and unfortunately does not directly layer on top of RDF.