Hi, I am a mechanical engineer learning python. Hence I am not familiar with the jargon. I was solving ex23 from the book “Learn Python 3 the hard way”.
Can someone tell me what exactly is utf-8? I have tried to google it but, I am not able to understand. This will really help me get ahead with my exercise
Under the hood strings are just sequences of numbers. An encoding tells the computer how to read the numbers: it specifies the width of individual numbers (8 bit, 16 bit etc.) and maps characters to numbers, a bit like a dictionary:
{ 96: 'a', 97: 'b', ... }
A raw text file in memory might look like an array of numbers, e.g. [104, 101, 108, 108, 111], but if you know the encoding is ASCII, you can decipher the string "hello".
Ah yes. I remember Zed explaining about this in a live session once, and it expanded my understanding of computing history!
A byte can be 1 or 0 as we know. 8 bits can provide 256 distinct values ranging from the bitstring 00000000 to 11111111. These numbers are undefined (just numbers) without a reference that describes how they should be encoded. “utf-8” is an agreed set of references with certain values for certain numbers in the range from 0-255.
“utf-16” extends this number range (and therefore character range) in recognition of other languages, characters etc. Unicode is a popular encoding currently being expanded frequently with little emoji - that we all know and love.
As @florian says, without the encoding type, its just a bitstring. With the encoding types known, it can be recognised as a text file, or sound wave, or image, etc…
It’s not history! It’s how computers work now. The fact that modern comfortable languages hide pretty much all the nitty gritty details doesn’t change that one bit.
Agreed @florian. The history element was ‘why’ computers do what they do now. I find it fascinating how some of the apparent oddness is software design makes much more sense when viewed as historical evolution.