Text Book Problem

S_Deshpande · February 5, 2021, 4:41pm

Hi, I am a mechanical engineer learning python. Hence I am not familiar with the jargon. I was solving ex23 from the book “Learn Python 3 the hard way”.

Can someone tell me what exactly is utf-8? I have tried to google it but, I am not able to understand. This will really help me get ahead with my exercise

florian · February 5, 2021, 9:01pm

It’s a text encoding.

Under the hood strings are just sequences of numbers. An encoding tells the computer how to read the numbers: it specifies the width of individual numbers (8 bit, 16 bit etc.) and maps characters to numbers, a bit like a dictionary:

{ 96: 'a', 97: 'b', ... }

A raw text file in memory might look like an array of numbers, e.g. [104, 101, 108, 108, 111], but if you know the encoding is ASCII, you can decipher the string "hello".

gpkesley · February 8, 2021, 1:11pm

Ah yes. I remember Zed explaining about this in a live session once, and it expanded my understanding of computing history!

A byte can be 1 or 0 as we know. 8 bits can provide 256 distinct values ranging from the bitstring 00000000 to 11111111. These numbers are undefined (just numbers) without a reference that describes how they should be encoded. “utf-8” is an agreed set of references with certain values for certain numbers in the range from 0-255.

“utf-16” extends this number range (and therefore character range) in recognition of other languages, characters etc. Unicode is a popular encoding currently being expanded frequently with little emoji - that we all know and love.

As @florian says, without the encoding type, its just a bitstring. With the encoding types known, it can be recognised as a text file, or sound wave, or image, etc…

florian · February 8, 2021, 6:00pm

It’s not history! It’s how computers work now. The fact that modern comfortable languages hide pretty much all the nitty gritty details doesn’t change that one bit.

gpkesley · February 9, 2021, 8:00am

Agreed @florian. The history element was ‘why’ computers do what they do now. I find it fascinating how some of the apparent oddness is software design makes much more sense when viewed as historical evolution.

Before ASCII things must have been a Wild West.