Excercise 23 - incorrect decoding/encoding

Hi,
When I run the exercise 23 code in Jupyter Notebook, the decoding and encoding looks funny and the bytes inbetween the b’…’ is different from Zed’s. Do you know if this might be a problem due to Jupyter? It should be able to read unicode as it is a based on a browser, but somehow it seems to go wrong.

(I know I should not use Jupyter to do the exercises, but kind of forced as we use anaconda at my work. And it has actually taught me alot, since I have had to debug and rewrite code to make it work in Jupyter.)

Here is my code and output:

import sys

sys.argv=['script', 'utf-8', 'strict'] 

def main(language_file, encoding, errors):
    line = language_file.readline()

    if line:
        print_line(line, encoding, errors)
        return main(language_file, encoding, errors)

def print_line(line, encoding, errors):
    next_lang = line.strip()
    raw_bytes = next_lang.encode(encoding, errors=errors)
    cooked_string = raw_bytes.decode(encoding, errors=errors)

    print(raw_bytes, "<===>", cooked_string)

languages = open("languages.txt", encoding = "utf-8")

main(languages, sys.argv[1], sys.argv[2])

And then the output (I’ll just show you the top two lines).
b’\xef\xbb\xbfAfrikaans’ <===> Afrikaans
b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’<===> አማርኛ

The “error” is not major, and my script runs. But the decoding/encoding becomes incorrect, and it bugs the hell out of me.

Hope someone knows what went wrong,

Heidi

You should not be using Jupyter notebooks to do any of these exercises. It’ll cause all kinds of problems like this and isn’t really how anyone writes software. It’s fine for hacking up and testing ideas but then falls flat when you want to work with any other tools.

Appreciate prompt reply. I have now requested admin right on my computer, so that I can install python and continue working with Atom and Powershell.

That’s great. Also, I prefer https://code.visualstudio.com/ these days over Atom but both will work.

1 Like

Hello everyone,

I just wanted to double-check - should the code in the book really work or is it a hidden challenge to fix it first? I’ve been pondering on how to solve it for the last 3 days and yet, and I can’t google anything that makes at least a bit more sense. Or am I so bad that I can’t even copy what’s in the book?

The last warning message that I got:
AttributeError: ‘bytes’ object has no attribute ‘encode’

Complete code:

import sys
script, encoding, error = sys.argv


def main(language_file, encoding, errors):
    line = language_file.readline()

    if line:
        print_line(line, encoding, errors)
        return main(language_file, encoding, errors)


def print_line(line, encoding, errors):
    next_lang = line.strip()
    raw_bytes = next_lang.encode(encoding, errors=errors)
    cooked_string = raw_bytes.decode(encoding, errors=errors)

    print(raw_bytes, "<===>", cooked_string)


languages = open("languages.txt", "rb")

main(languages, encoding, error)

Thank you so much in advance!

You are opening the file as a stream of raw bytes, hence next_lang is already a bytes object. You can decode bytes objects to get strings, or you can encode strings to get bytes. You are trying to encode bytes, which doesn’t really make sense, I guess.

Thank you, Florian, and your comment makes perfect sense. I’m really new with this and I’ve tried doing it vice versa - decoding raw_bytes and encoding cooked_string. However, I couldn’t get it working that way, I suppose there’s something more to it.

UPDATE: with the help of a real developer, we managed to crack it. When from open(“languages.txt”, “rb”) it was changed to open(“languages.txt”), it started working as a charm!

Thank you all for the help!

1 Like

Hmmmmmm, are you on windows? I think the ‘rb’ might be a windows only thing, and I should have mentioned that so let me know if that’s the problem and if I mentioned that.

Hello Zed,

No, I’m actually on Mac. Might that have been the issue?

Thanks.

No, ‘rb’ is normally on windows but maybe it’s now part of Python3’s unicode support since it needs a way to distinguish between opening a file of text and bytes.