[SOLVED, sort of] Help with Ex23 study drill 3

CodeNinja · March 25, 2020, 8:14am

Hello, I am not able to do the third study drill of exercise 23. It asks us to rewrite this script using b’ ’ bytes instead of UTF-8 strings, reversing this script. What is meant by that and how to do it?

Capture

Thanks!

gpkesley · March 25, 2020, 12:49pm

Have a read of the book explaining how the code works in this chapter. Try adding comments to your code so you can explain what each line is doing.

Then have a think about what you might need to do so you could use the same script with amendments to encode from a ‘bytes’ source.

CodeNinja · March 27, 2020, 5:38pm

Okay, understood. I will do it and update my progress here.

CodeNinja · March 29, 2020, 12:30pm

Okay… So after countless hours of trial and error, endless frustration, and surfing the web, I found another thread on this forum with the same questions ([Solved?] LPTHW3 Exercise 23 Extra Credit 3 -- Reverse Script - #5 by namninja). I read it all and came up with this:-

NOTE: The bytes.txt file I use here is just the encoded form of the text in the file languages.txt in UTF-8

def read_byte(file, encoding, errors):
    line = bytes(file.readline()) 

    if line:
        print_string(line, 'utf-8', 'strict')
        return read_byte(file, encoding, errors)


def print_string(line, encoding, errors):
    byte = line.strip()
    string = byte.decode(encoding, errors)

    print(byte, '--->', string)

bytes = open('bytes.txt', 'r') 
encoding = 'utf-8'
errors = 'strict'

read_byte(bytes, encoding, errors)

But when I ran it, I got this thing

Capture

So, I looked at my code and made a couple tweaks:-

def read_byte(file, encoding, errors):
    line = file.readline() # Did not change it to bytes data type.

    if line:
        print_string(line, 'utf-8', 'strict')
        return read_byte(file, encoding, errors)


def print_string(line, encoding, errors):
    byte = line.strip()
    string = byte.decode(encoding, errors)

    print(byte, '--->', string)

bytes = open('bytes.txt', 'rb') # Used 'rb' instead of 'r'.
encoding = 'utf-8'
errors = 'strict'

read_byte(bytes, encoding, errors)

It worked, but my output was this (this is just a part of it as it is too large):

b"b’Afrikaans’" —> b’Afrikaans’


b"b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’" —> b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’
b"b’\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0’" —> b’\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0’
b"b’\xc3\x98\xc2\xa7\xc3\x99\xe2\x80\x9e\xc3\x98\xc2\xb9\xc3\x98\xc2\xb1\xc3\x98\xc2\xa8\xc3\x99\xc5\xa0\xc3\x98\xc2\xa9’" —> b’\xc3\x98\xc2\xa7\xc3\x99\xe2\x80\x9e\xc3\x98\xc2\xb9\xc3\x98\xc2\xb1\xc3\x98\xc2\xa8\xc3\x99\xc5\xa0\xc3\x98\xc2\xa9’

b"b’Aragon\xc3\x83\xc2\xa9s’" —> b’Aragon\xc3\x83\xc2\xa9s’

I got something to work but it still doesn’t decode the bytes to give me strings. What am I doing wrong here and how can I fix it???

gpkesley · March 30, 2020, 7:40am

Did you look at this thread? Search…

Ex23 Study Drill 3: Found two solutions, but is one of them better?

CodeNinja · March 31, 2020, 3:07pm

I looked at the thread you mentioned and copied both the solutions to see what their output was and I got the same output as mine. Does that mean I did what we were supposed to do???

CodeNinja · March 31, 2020, 3:43pm

I tried to go and see what was wrong with my code. I thought I get back my encoded file because python thinks it is a string data type as it always put the text in b" " and then takes it as a byte. So, I wrote a program to see if that was the case.

 def read_byte(file):
    text = file.seek(0)
    text1 = file.readline()
    print(type(text1))
    print(text1)


def read_byte1(file):
    text = file.seek(0)
    text1 = file.readline()
    line = bytes(text1, 'utf-8', 'strict')
    print(type(line))
    print(line)

file = open('bytes.txt')

read_byte(file)
read_byte1(file)

And this is the output I got:

PS C:\Users\admin\lpthw> python ex23-2.py                                  
<class 'str'>
b'Afrikaans'

<class 'bytes'>
b"b'Afrikaans'\n"

So, apparantely, Python treats my file as a string rather than bytes as seen in the first function [read_bytes()] but when I convert it to bytes in the second function [read_bytes1()], it does treat them as bytes. But why does it not treat them as bytes the first time?

CodeNinja · March 31, 2020, 4:56pm

Okay, I tried one more thing. I saw my bytes.txt file and removed all the b’ ’ from it. I made some changes to my program(the one which checked the data type).

def read_byte(file):
    text = file.seek(0)
    text1 = file.readline()
    print(type(text1))
    print(text1)


def read_byte1(file):
    text = file.seek(0)
    text1 = file.readline()
    line = bytes(text1, 'utf-8', 'strict')
    print(type(line))
    print(line)

file = open('bytes.txt', 'rt', 1, 'utf-8', 'strict') # I opened it in read + text mode and specified the other parameters.

read_byte(file)
read_byte1(file)

Then, I tried to see again what the data type was and this time, I got this:

PS C:\Users\admin\lpthw> python ex23-2.py                                    
<class 'str'>
Afrikaans

<class 'bytes'>
b'Afrikaans\n'

So, this time the text in the file is a string data type, which I can easily convert to bytes data type. After this, I went to my main script and ran it to see if it worked now and this is what I got (again, only a part of the output):

b'Afrikaans' ---> Afrikaans
b'\\xc3\\xa1\\xc5\\xa0 \\xc3\\xa1\\xcb\\x86\\xe2\\x80\\xba\\xc3\\xa1\\xcb\\x86\\xc2\\xad\\xc3\\xa1\\xc5\\xa0\\xe2\\x80\\xba' ---> \xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba
b'\\xc3\\x90\\xc2\\x90\\xc3\\x92\\xc2\\xa7\\xc3\\x91\\xc2\\x81\\xc3\\x91\\xcb\\x86\\xc3\\x93\\xe2\\x84\\xa2\\xc3\\x90\\xc2\\xb0' ---> \xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0

Still the same thing, so I went and made the following changes to my code:

def read_byte(file): # Used only 1 variable for file.
    line = file.readline()

    if line:
        print_string(line)
        return read_byte(file)


def print_string(line): # Used only 1 variable for file.
    text = line.strip()
    byte = bytes(text, 'utf-8', 'strict') # Tried changing it to bytes data type.
    string = byte.decode('utf-8', 'strict')

    print(byte, '--->', string)

bytes = open('bytes.txt', 'rt', 1, 'utf-8', 'strict') # Opened the file in read + text mode as in the code above.

read_byte(bytes)

But I received this error:


PS C:\Users\admin\lpthw> python ex23-1.py
Traceback (most recent call last):
  File "ex23-1.py", line 18, in <module>
    read_byte(bytes)
  File "ex23-1.py", line 3, in read_byte
    line = bytes(text, 'utf-8', 'strict')
TypeError: '_io.TextIOWrapper' object is not callable

I don’t know what that means and have completely run out of ideas… I can’t understand what I am doing wrong.

KLPX · April 1, 2020, 8:40pm

Hi, I’m the guy from the two solutions thread and your output is not what you should see.
It should look like this:

Afrikaans <===> b'Afrikaans'
አማርኛ <===> b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
Аҧсшәа <===> b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0'
العربية <===> b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'
Aragonés <===> b'Aragon\xc3\xa9s'
Arpetan <===> b'Arpetan'

florian · April 2, 2020, 7:26am

Hey @CodeNinja, don’t worry, it’s a simple name clash.

You are assigning the file handle to a variable bytes, thus shadowing the built-in bytes class that you are trying to call in print_string. Rename that file to byte_file or whatever and it should work.

CodeNinja · April 2, 2020, 11:30am

@KLPX Hi! The output that you showed me is encoding the string on the left and giving us bytes but I wanted my script to have bytes on the left and give us the decoded strings on the right.

@florian Thanks a lot! That worked.

KLPX · April 2, 2020, 1:11pm

@CodeNinja Ok, now I’m confused. Bytes on the left and Strings on the right was the main exercise. The output I showed you is the reversed version from study drill 3. I thought that’s what you’re trying to do, is it not?

CodeNinja · April 2, 2020, 2:18pm

@KLPX Yeah, you’re right. But the main exercise was about encoding strings and what I wanted to do was decode bytes. Perhaps I didn’t phrase that right, sorry!

Basically, I had copied the bytes into another file and wanted to decode them. Your program works with the languages.txt file but not with the bytes.txt file (neither mine). So, when I used it with my file, it gave me the same output.

This is my program:

def read_byte(file): 
    line = file.readline()

    if line:
        print_string(line)
        return read_byte(file)


def print_string(line): 
    text = line.strip()
    byte = bytes(text, 'utf-8', 'strict') # Tried changing it to bytes data type.
    string = byte.decode('utf-8', 'strict')

    print(byte, '--->', string)

file = open('languages.txt', 'rt', 1, 'utf-8', 'strict') 

read_byte(file)

And I got this output (what’s on what side doesn’t matter):

b'Afrikaans' ---> Afrikaans
b'\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba' ---> áŠ áˆ›áˆáŠ›
b'\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0' ---> ÐÒ§ÑÑˆÓ™Ð°

But when I used the same program with bytes.txt, I got this thing:

b'Afrikaans' ---> Afrikaans
b'\\xc3\\xa1\\xc5\\xa0 \\xc3\\xa1\\xcb\\x86\\xe2\\x80\\xba\\xc3\\xa1\\xcb\\x86\\xc2\\xad\\xc3\\xa1\\xc5\\xa0\\xe2\\x80\\xba' ---> \xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba
b'\\xc3\\x90\\xc2\\x90\\xc3\\x92\\xc2\\xa7\\xc3\\x91\\xc2\\x81\\xc3\\x91\\xcb\\x86\\xc3\\x93\\xe2\\x84\\xa2\\xc3\\x90\\xc2\\xb0' ---> \xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0

I don’t really know why Python does that. But thanks a lot for the reply!

KLPX · April 2, 2020, 4:30pm

Got it. Now I’m curious too so I tried it myself and got a similar result:

The raw line is: b'Afrikaans\n'
The decoded line is: Afrikaans

The raw line is: b'\\xe1\\x8a\\xa0\\xe1\\x88\\x9b\\xe1\\x88\\xad\\xe1\\x8a\\x9\n'
The decoded line is: \xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9

The raw line is: b'\\xd0\\x90\\xd2\\xa7\\xd1\\x81\\xd1\\x88\\xd3\\x99\\xd0\\xb0\n'
The decoded line is: \xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0

The raw line is: b'\\xd8\\xa7\\xd9\\x84\\xd8\\xb9\\xd8\\xb1\\xd8\\xa8\\xd9\\x8a\\xd8\\xa9\n'
The decoded line is: \xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9

The raw line is: b'Aragon\\xc3\\xa9s\n'
The decoded line is: Aragon\xc3\xa9s

The raw line is: b'Arpetan\n'
The decoded line is: Arpetan

The raw line is: b'Az\\xc9\\x99rbaycanca\n'
The decoded line is: Az\xc9\x99rbaycanca

The raw line is: b'Bamanankan'
The decoded line is: Bamanankan

As you can see the decode doesn’t do what I expected it to do. It just removed the byte-string (b’ ').

UPDATE: I think I get what the problem might be: when I read the lines the program escapes all \ so I get \\ which can’t be properly decoded. At least that’s what I think at the moment. You’ve figured that out already I guess but I was really blind and did not see it. Now to figure out why it does that.

KLPX · April 2, 2020, 6:05pm

I think I figured it out (thanks to stack overflow).

This is my source file (bytes.txt):

\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82
Afrikaans
\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0
\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9
Aragon\xc3\xa9s

This is my output:

τoρνoς <===> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
Afrikaans <===> b'Afrikaans'
Аҧсшәа <===> b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0'
العربية <===> b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'
Aragonés <===> b'Aragon\xc3\xa9s'

This is my code (quick and dirty):

def main(language_file):
    line = language_file.readline().encode('raw_unicode_escape')

    if line:
        print_line(line)
        return main(language_file)

def print_line(line):
    lang = line.strip()
    mystring = lang.decode()
    mybytes = mystring.encode('utf-8')
    
    print(mystring, "<===>", mybytes)

languages = open('bytes.txt', 'r', encoding='unicode_escape')

main(languages)

languages.close()

The trick is the unicode_escape and raw_unicode_escape.
Here is the link to stack overflow: https://stackoverflow.com/questions/36970312/read-xhh-escapes-from-file-as-raw-binary-in-python/36970420#36970420

I’m not sure I completely understand it, maybe @zedshaw or someone else can help us out here .

CodeNinja · April 3, 2020, 11:07am

@KLPX Yup, that worked! Well, I now need to know why unicode_escape and raw_unicode_escape work but not UTF-8… Thanks a lot!!! I will see if I can find anything.

johnbriton · January 26, 2022, 12:19am

Hi, I recently encounter the same problem in decoding bytes from file to strings. I found that there are two parameters that are very important: unicode_escape and raw_unicode_escape. I will explain their functions in my codes. The structure of my codes are the same as Zed’s in ex23:

import sys

script, input_encoding, error = sys.argv


def main(language_file, encoding, errors):
    line0 = language_file.readline()
    line = line0.encode("raw_unicode_escape")
    # encode("raw_unicode_escape") is to encode a str like "\xxx\yyy\zzz"
    # to a bytes looks same as b"\xxx\yyy\zzz", where b"" is just a symbol for bytes
    # However, in order to have a clean "\xxx-like" str
    # I first need to prevent python from automatically encode \ to \\ when reading files.

    if line:

        print_line(line, encoding, errors)
        return main(language_file, encoding, errors)


def print_line(line, encoding, errors):
    cooked_lang = line.decode()

    print(cooked_lang)


languages = open("raw_bytes.txt", "r", encoding = "unicode_escape")
# This line is to open the file contains things like "\xxx\yyy\zzz"
# And make it ready for computer to read
# Then encode it with a encoding system called <unicode_escape>
# Which prevent python from automatically encode \ to \\ in strings.

main(languages, input_encoding, error)

Please forgive me to use my own word trying to explain these two parameters. I hope that it may help.