Excercise 23 - incorrect decoding/encoding


#1

Hi,
When I run the exercise 23 code in Jupyter Notebook, the decoding and encoding looks funny and the bytes inbetween the b’…’ is different from Zed’s. Do you know if this might be a problem due to Jupyter? It should be able to read unicode as it is a based on a browser, but somehow it seems to go wrong.

(I know I should not use Jupyter to do the exercises, but kind of forced as we use anaconda at my work. And it has actually taught me alot, since I have had to debug and rewrite code to make it work in Jupyter.)

Here is my code and output:

import sys

sys.argv=['script', 'utf-8', 'strict'] 

def main(language_file, encoding, errors):
    line = language_file.readline()

    if line:
        print_line(line, encoding, errors)
        return main(language_file, encoding, errors)

def print_line(line, encoding, errors):
    next_lang = line.strip()
    raw_bytes = next_lang.encode(encoding, errors=errors)
    cooked_string = raw_bytes.decode(encoding, errors=errors)

    print(raw_bytes, "<===>", cooked_string)

languages = open("languages.txt", encoding = "utf-8")

main(languages, sys.argv[1], sys.argv[2])

And then the output (I’ll just show you the top two lines).
b’\xef\xbb\xbfAfrikaans’ <===> Afrikaans
b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’<===> አማርኛ

The “error” is not major, and my script runs. But the decoding/encoding becomes incorrect, and it bugs the hell out of me.

Hope someone knows what went wrong,

Heidi


#2

You should not be using Jupyter notebooks to do any of these exercises. It’ll cause all kinds of problems like this and isn’t really how anyone writes software. It’s fine for hacking up and testing ideas but then falls flat when you want to work with any other tools.


#3

Appreciate prompt reply. I have now requested admin right on my computer, so that I can install python and continue working with Atom and Powershell.


#4

That’s great. Also, I prefer https://code.visualstudio.com/ these days over Atom but both will work.