Semi-structured and semantic data

HTML: an incredibly important, but semi-structured language

Web pages are built with HTML. That's not a programming language, just a structured way of "writing." You carefully mark all parts of the page. All web pages you visit on the Internet, as well as a lot of apps (social media apps, for example) are built with the language HTML. That's why HTML is a particularly important language!

With the markup language HTMLyou build Web pages and can also link them together (linking). You don't build a web page with a word processor. It is best to write it in code. A browser (Chrome, Firefox, Internet Explorer, Opera, Safari...) reads that code and converts it into a human-readable display. You have to specify in your HTML code exactly where a title, paragraph, link, image or whatever is, because a browser is a piece of software and that software otherwise doesn't know what a title or a paragraph is for you.

Therefore, you have to select each part and say to the browser, "Look, this is a title!" It's a bit like studying a text: you indicate the most important parts with a highlighter. Of course, we can't start anything with a highlighter when building a Web page. In HTML (hypertext markup language), we markan element by indicating where the element begins and where it ends. You do this in the following way:

<element>
        The content of the element
</element>

Of course, you should not enter "element," but the agreed-upon designations, usually abbreviations: for example, "p" stands for "paragraph" and "h1" stands for a large "headline." There are a number of HTML elements in a web page. You really don't need to know everything. With a limited number of basic elements, you can build just about any web page.

Suppose you want to write a text with a large headline and below it a paragraph with a picture, the HTML code would look like this:

<h1>Chapter 1: HTML is a semi-structured language</h1>
<p>HTML is not a programming language...</p>
<img src="assets/images/picture.jpg"/>

Everything is a link, including photos and movies

You can also "link" mediasuch as pictures, movies and maps in HTML, but those media, unlike in a presentation file such as PowerPoint, are not in the file. They are linked to it. The Web browser in which you view the HTML page knows where to place the media, thanks to an HTML instruction.

<img src="assets/images/picture.jpg"/>

Semi-structured data

Why do we say HTML contains semi-structured information? Unlike XML or JSON, you can't "tell" from the HTML structure what the data or information is about. A h1-tag indicates a title, but you don't know if it's the title of a news story or the name of a pair of shoes in the web shop. Of course, as a human "visitor" to a website, you can see and know that, but a search engine like the Googlebot has to put a lot more effort into discovering the "semantics."

Semanticsmeans that the document itself "tells" what the content is about. Compare the examples below in XML and HTML.

XML

<book>
    <title>Uit het hoofd</title>
    <author>Kris Merckx</author>
    <description>Welk voordeel biedt het ons evolutionair gezien om informatie te kunnen onthouden? Waarom verzamelen grote techbedrijven zoals Google en Facebook massa’s data? Hoe komt het dat 90% van alle data die de mens doorheen de geschiedenis produceerde, stamt uit het laatste decennium?</description>
</book>

HTML

<article>
    <h1>Uit het hoofd</h1>
    <p>Kris Merckx</p>
    <p>Welk voordeel biedt het ons evolutionair gezien om informatie te kunnen onthouden? Waarom verzamelen grote techbedrijven zoals Google en Facebook massa’s data? Hoe komt het dat 90% van alle data die de mens doorheen de geschiedenis produceerde, stamt uit het laatste decennium?</p>
</article>

In the XML file, the structure and markings tell us that the content is about a "book." In the HTML document, you find that out as a human reading the file, but a piece of software can only recognize the structure as an "article" with a title and two paragraphs, but finds out little about the type of content.

Next page