Text files and binary files

Text files and binary files

We ask ourselves three questions:

  1. What are text files? 
  2. What is the difference between text files and binary files?
  3. Why can we say that most file types contain unstructured data 

Sorry, you misunderstand. By text files, in this case, we don't mean the documents you type into MS Word or LibreOffice. Of course, they do have some things in common. A file from MS Word is a word processing file. If you save it to your hard drive, it becomes a binary and/or a compressed file. When we talk about text files in the title, we are referring to "plain text" files. Files that contain only characters you can type on a keyboard, but no formatting or graphics. 

As an aspiring data expert, it is extremely important that you know what we mean by plain text. As a data analyst, you will be working with plain text files in text editors anyway.What exactly this is, you will read in a moment...

ASCII and plain text

The computer stores all files in binary, as a series of zeros and ones, because that is the only code most modern computers understand. Each file consists of an often very long, yet finite series of zeros and ones. The computer stores them on a storage medium (hard disk, USB stick...). 

As I type this sentence on my keyboard, the computer translates the keystrokes into zeros and ones, into bits which in turn are grouped into bytes. One byte is a "group" of eight bits; for example, the letter a the computer translates as the byte 01100001. In decimal numbers, this is 97. Each key and combination of keys (eg SHIFT + A for the capital A) is thus a byte assigned. Since a byte consists of 8 bits, we have 28 combinations (2 to the eighth power), because each bit can be either 1 or 0. In total you can have 256 combinations, or all combinations between 000000=0 and 11111111= 255

Which keyboard keys are associated with which number is based on international conventions. Basically, you can connect any keyboard to any digital system that allows text input. Keystrokes are recognized automatically. This is because virtually all of them support ASCII. ASCII(American Standard Code for Information Interchange)is a standard for linking Latin letters, numbers, punctuation marks and a number of other characters and a series of control codes to a number.

The ASCII table is a two-dimensional rowor table. Find the desired character and take the binary code indicated in the row header. Then paste after that the contents of the column header. Thus, a space consists of 0010 followed by 0000 or concatenated0 0 1 0 0 0 0 0, which corresponds to the decimal number 32.

In a text editor, you can enter only ASCII characters (or UTF-8 characters, about which more later). You can't format your text or add pictures to it,for example. I can hear you thinking: what's the point of that?

The big advantage is that this kind of file is perfectly interchangeable. You do not need any special software to read the contents of those documents. A simple editor such as Windows Notepad or Visual Studio Code suffices to open the file. Programming code or web pages can be opened perfectly in an editor, because they can only include characters supported by the ASCII or UTF standards.

You may have already experienced it. You copy a text from Wikipedia and paste it into an MS Word document. Also, all the links that were in the web page and the formatting, are copied along with it. If you want to avoid this, first paste the text into a text editor. That way, you only have the "plain text".
The ASCII table 

UNICODE and UTF-8

ASCII works perfectly, but for example Arabs, Chinese and Japanese, in short, everybody who did not speak English or any other Western language were in trouble. No binary codes existed for Arabic or Chinese (a "standard" Chinese knows about 7,000 different characters). The recent Hanyu Cidian Chinese covers as many as 56,000 characters! Extensions to the ASCII character set were necessary. 

The need to include all other language systems and codes led to UNICODE.The UNICODE character set allows the use of all conceivable alphabets and script systems in a single document. Not that you need this often, because you don't write or type every day in Pau Cin Hau, Bamum or Sharada, but the use of dingbats, mathematical symbols, etc. is something you come across a bit more often.  UNICODE  is an ISO standard.The most well-known character set of UNICODE is  UTF-8  (8 bit Unicode Transformation Format).  Modern web pages are encoded according to this standard, allowing them to display any symbol by default.  

Binary files 

Although ASCII files offer a lot of advantages, it is not always the best way to store files. When one talks about BINARY files, one is talking about files that are not encoded according to the ASCII character set.There are several good reasons not to do this.To preserve most files such as photos, documents from a word processor, presentations, movies... one uses a "binary encoding". ASCII in that case would lead to files that are too large.Very often, in addition, people use compression techniques to make files a lot smaller so they take up less space. Just think how quickly your smartphone fills up if you take pictures or movies all day.

Why not ASCII/UNICODE?

An example: a letter "a" (not the capital letter) on your keyboard translates as the byte 01100001. One character is replaced by a string of 8 zeros and/or ones.So in this case, a computer uses 8 bits where a human only needs one character. This also applies to the digits on your keyboard. If you type a 0, it becomes 00110000 according to ASCII/UNICODE rules and a 1 becomes 00110001. The number 256 treats it according to the ASCII character set as 3 separate ASCII/UNICODE characters or 3 bytes and then looks like this:00110010 00110101 00110110.

A rather crazy way of doing things if we only want to store numbers, because internally ASCII already makes the translation to decimal digits by attaching a decimal number to each key or character.Numbers can be stored much more simply and compactly by storing them not as a series of individual characters, but as one whole, as a real number. 255 will then look like 1111 1111.A single byte 256 can contain different values.Going to 4 bytes we get 4 billion possible combinations. The number 4000000000 (4 billion) would take up only 4 bytes in that  way.

Storing it as ASCII/UNICODE would eat up 10 bytes of space.For this reason, people use a lot of forms of encoding.It would be downright silly, for example, to store an image as an array of ASCII/UNICODE values (albeit binary encoded). An image of 1024 by 768 pixels has a total of786432 pixels. Since each pixel consists of a mix of 256 values of red, 256 values of green and 256 values of blue, then encoding it according to the ASCII/UNICODE character set would take up an improbable amount of space.

Disadvantages of ASCII/UNICODE:  

  • files become too large According to ASCII/UNICODE. 
  • When you type a number on your keyboard, it is first translated into another number. A 0 is translated to "48" that way.
    Binary files are any files that contain more than just pure text characters. A few examples: docx, pptx, xlsx, jpg, mp4, mp3, pdf, zip.... 

Why not store all the files in ASCII/UNICODE format? 

  • Reduce required storage space 
  • Commercial software companies want to prevent people from being able to see how they "store" their files and what information is kept in them. So they can "force" users to buy their software if you want to open a specific file.

Advantages of ASCII/UNICODE:   

  • You do not need special software licenses to edit the files.
  • Text files are easy and simple to store structured data in.
  • You can easily develop your own software that can import or export text files.
Next page