Structured and unstructured data

The difference between unstructured and structured data

What is the difference between structured and unstructured data?
How can you structure data?
When are data semantic?

Most data and files contain somewhat unstructured data for computers. This sounds strange when you realize that digital data is now created for and by computers. However, this does not mean that the computer also understands the content of that data.

A picture, for example, is constructed by a computer by filling a grid of columns and rows with color information. A pixel (picture element) is a box in such a grid and to a computer contains three numeric data: a color value for red, a value for green and a value for blue. A computer will additionally assign each pixel an x and y position in the grid and thus remembers five numerical data for that pixel. So that is structured in a sense, but that still tells nothing about the content of the image. Is there a cow, an apple or a bottle of beer in the picture? The data doesn't tell you that in this case. You could add that information so that the "image source" no longer just contains pixel values and positions, but also "labels."

This is also the case for texts. Each letter in a text is stored as a cipher, but the computer knows nothing about the content of the text

What can a computer "recognize"?

Binary encoded files are more difficult for a computer system to "recognize" (or "understand") than text files (=ASCII, UTF-8...). After all, a text file only contains "text" that you can also type with a standard keyboard. So you can tell a piece of software to only look at the "text". ASCII/UTF-8 text files can be read by any computer system. You don't need special software to decode the binary codes.

You can, however, "teach" software to recognize certain patternsin text files:

line breaks: hard returns
punctuation: having people look at repeating patterns of commas, colons, quotation marks
certain repeating patterns or marks in text form
"agreed" characters or keywords
brackets and parentheses
tabs

Sometimes several of the above recognition patterns are present in a single file. But they are not always "relevant" to the computer itself. Line breaks or tabs are also often used in code to keep the code or data "'readable" by humans. In the examples at the bottom of this lesson, consider which recognition patterns are relevant to the software itself.

Structured data

So what do we really mean by structured data? It involves

data with clear patterns that a machine/computer can easily recognize.
For example: a list of rows and columns, credit card numbers, zip codes,
structured communications on wire transfers....
For example: data with key and value pairs.
Data with key and value pairs.
For example, a JSON file.

Unstructured data

Unstructured data offer little or no guidance about the content of the data or its "semantics."

Data that are not in a table with a recognizable number of rows and columns or keys with their values.
For example: text documents, web pages, presentations....
Data with less clear patterns for computer systems (this does not mean that there are no patterns at all)
For example: Images, audio, video...

Types of structured data

Over time, a number of standards have developed for structuring text data.

Markup languages such as HTML(hypertext markup language) and XML(extensible markup language) mark up elements in a text file by indicating where a part begins and where one ends. In this way, you mark up the various parts of your data. That way, software can find more quickly where a particular element is.
You can also separate text data with punctuation, spaces or tabs. A CSVfile separates rows with line breaks (hard returns), columns with a comma.
YAML(YAML Ain't Markup Language) separates data with tabs and line breaks.
JSON(Javascript object notation) uses various punctuation (commas, quotation marks...) and choppers.