Hello, I am not able to do the third study drill of exercise 23. It asks us to rewrite this script using b’ ’ bytes instead of UTF-8 strings, reversing this script. What is meant by that and how to do it?
Thanks!
Hello, I am not able to do the third study drill of exercise 23. It asks us to rewrite this script using b’ ’ bytes instead of UTF-8 strings, reversing this script. What is meant by that and how to do it?
Thanks!
Have a read of the book explaining how the code works in this chapter. Try adding comments to your code so you can explain what each line is doing.
Then have a think about what you might need to do so you could use the same script with amendments to encode from a ‘bytes’ source.
Okay, understood. I will do it and update my progress here.
Okay… So after countless hours of trial and error, endless frustration, and surfing the web, I found another thread on this forum with the same questions ([Solved?] LPTHW3 Exercise 23 Extra Credit 3 -- Reverse Script - #5 by namninja). I read it all and came up with this:-
NOTE: The bytes.txt file I use here is just the encoded form of the text in the file languages.txt in UTF-8
def read_byte(file, encoding, errors):
line = bytes(file.readline())
if line:
print_string(line, 'utf-8', 'strict')
return read_byte(file, encoding, errors)
def print_string(line, encoding, errors):
byte = line.strip()
string = byte.decode(encoding, errors)
print(byte, '--->', string)
bytes = open('bytes.txt', 'r')
encoding = 'utf-8'
errors = 'strict'
read_byte(bytes, encoding, errors)
But when I ran it, I got this thing
So, I looked at my code and made a couple tweaks:-
def read_byte(file, encoding, errors):
line = file.readline() # Did not change it to bytes data type.
if line:
print_string(line, 'utf-8', 'strict')
return read_byte(file, encoding, errors)
def print_string(line, encoding, errors):
byte = line.strip()
string = byte.decode(encoding, errors)
print(byte, '--->', string)
bytes = open('bytes.txt', 'rb') # Used 'rb' instead of 'r'.
encoding = 'utf-8'
errors = 'strict'
read_byte(bytes, encoding, errors)
It worked, but my output was this (this is just a part of it as it is too large):
b"b’Afrikaans’" —> b’Afrikaans’
b"b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’" —> b’\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba’
b"b’\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0’" —> b’\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0’
b"b’\xc3\x98\xc2\xa7\xc3\x99\xe2\x80\x9e\xc3\x98\xc2\xb9\xc3\x98\xc2\xb1\xc3\x98\xc2\xa8\xc3\x99\xc5\xa0\xc3\x98\xc2\xa9’" —> b’\xc3\x98\xc2\xa7\xc3\x99\xe2\x80\x9e\xc3\x98\xc2\xb9\xc3\x98\xc2\xb1\xc3\x98\xc2\xa8\xc3\x99\xc5\xa0\xc3\x98\xc2\xa9’
b"b’Aragon\xc3\x83\xc2\xa9s’" —> b’Aragon\xc3\x83\xc2\xa9s’
I got something to work but it still doesn’t decode the bytes to give me strings. What am I doing wrong here and how can I fix it???
Did you look at this thread? Search…
Ex23 Study Drill 3: Found two solutions, but is one of them better?
I looked at the thread you mentioned and copied both the solutions to see what their output was and I got the same output as mine. Does that mean I did what we were supposed to do???
I tried to go and see what was wrong with my code. I thought I get back my encoded file because python thinks it is a string data type as it always put the text in b" " and then takes it as a byte. So, I wrote a program to see if that was the case.
def read_byte(file):
text = file.seek(0)
text1 = file.readline()
print(type(text1))
print(text1)
def read_byte1(file):
text = file.seek(0)
text1 = file.readline()
line = bytes(text1, 'utf-8', 'strict')
print(type(line))
print(line)
file = open('bytes.txt')
read_byte(file)
read_byte1(file)
And this is the output I got:
PS C:\Users\admin\lpthw> python ex23-2.py
<class 'str'>
b'Afrikaans'
<class 'bytes'>
b"b'Afrikaans'\n"
So, apparantely, Python treats my file as a string rather than bytes as seen in the first function [read_bytes()] but when I convert it to bytes in the second function [read_bytes1()], it does treat them as bytes. But why does it not treat them as bytes the first time?
Okay, I tried one more thing. I saw my bytes.txt file and removed all the b’ ’ from it. I made some changes to my program(the one which checked the data type).
def read_byte(file):
text = file.seek(0)
text1 = file.readline()
print(type(text1))
print(text1)
def read_byte1(file):
text = file.seek(0)
text1 = file.readline()
line = bytes(text1, 'utf-8', 'strict')
print(type(line))
print(line)
file = open('bytes.txt', 'rt', 1, 'utf-8', 'strict') # I opened it in read + text mode and specified the other parameters.
read_byte(file)
read_byte1(file)
Then, I tried to see again what the data type was and this time, I got this:
PS C:\Users\admin\lpthw> python ex23-2.py
<class 'str'>
Afrikaans
<class 'bytes'>
b'Afrikaans\n'
So, this time the text in the file is a string data type, which I can easily convert to bytes data type. After this, I went to my main script and ran it to see if it worked now and this is what I got (again, only a part of the output):
b'Afrikaans' ---> Afrikaans
b'\\xc3\\xa1\\xc5\\xa0 \\xc3\\xa1\\xcb\\x86\\xe2\\x80\\xba\\xc3\\xa1\\xcb\\x86\\xc2\\xad\\xc3\\xa1\\xc5\\xa0\\xe2\\x80\\xba' ---> \xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba
b'\\xc3\\x90\\xc2\\x90\\xc3\\x92\\xc2\\xa7\\xc3\\x91\\xc2\\x81\\xc3\\x91\\xcb\\x86\\xc3\\x93\\xe2\\x84\\xa2\\xc3\\x90\\xc2\\xb0' ---> \xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0
Still the same thing, so I went and made the following changes to my code:
def read_byte(file): # Used only 1 variable for file.
line = file.readline()
if line:
print_string(line)
return read_byte(file)
def print_string(line): # Used only 1 variable for file.
text = line.strip()
byte = bytes(text, 'utf-8', 'strict') # Tried changing it to bytes data type.
string = byte.decode('utf-8', 'strict')
print(byte, '--->', string)
bytes = open('bytes.txt', 'rt', 1, 'utf-8', 'strict') # Opened the file in read + text mode as in the code above.
read_byte(bytes)
But I received this error:
PS C:\Users\admin\lpthw> python ex23-1.py
Traceback (most recent call last):
File "ex23-1.py", line 18, in <module>
read_byte(bytes)
File "ex23-1.py", line 3, in read_byte
line = bytes(text, 'utf-8', 'strict')
TypeError: '_io.TextIOWrapper' object is not callable
I don’t know what that means and have completely run out of ideas… I can’t understand what I am doing wrong.
Hi, I’m the guy from the two solutions thread and your output is not what you should see.
It should look like this:
Afrikaans <===> b'Afrikaans'
አማርኛ <===> b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
Аҧсшәа <===> b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0'
العربية <===> b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'
Aragonés <===> b'Aragon\xc3\xa9s'
Arpetan <===> b'Arpetan'
Hey @CodeNinja, don’t worry, it’s a simple name clash.
You are assigning the file handle to a variable bytes
, thus shadowing the built-in bytes
class that you are trying to call in print_string
. Rename that file to byte_file
or whatever and it should work.
@KLPX Hi! The output that you showed me is encoding the string on the left and giving us bytes but I wanted my script to have bytes on the left and give us the decoded strings on the right.
@florian Thanks a lot! That worked.
@CodeNinja Ok, now I’m confused. Bytes on the left and Strings on the right was the main exercise. The output I showed you is the reversed version from study drill 3. I thought that’s what you’re trying to do, is it not?
@KLPX Yeah, you’re right. But the main exercise was about encoding strings and what I wanted to do was decode bytes. Perhaps I didn’t phrase that right, sorry!
Basically, I had copied the bytes into another file and wanted to decode them. Your program works with the languages.txt file but not with the bytes.txt file (neither mine). So, when I used it with my file, it gave me the same output.
This is my program:
def read_byte(file):
line = file.readline()
if line:
print_string(line)
return read_byte(file)
def print_string(line):
text = line.strip()
byte = bytes(text, 'utf-8', 'strict') # Tried changing it to bytes data type.
string = byte.decode('utf-8', 'strict')
print(byte, '--->', string)
file = open('languages.txt', 'rt', 1, 'utf-8', 'strict')
read_byte(file)
And I got this output (what’s on what side doesn’t matter):
b'Afrikaans' ---> Afrikaans
b'\xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba' ---> አማáˆáŠ›
b'\xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0' ---> ÐÒ§Ñшәа
But when I used the same program with bytes.txt, I got this thing:
b'Afrikaans' ---> Afrikaans
b'\\xc3\\xa1\\xc5\\xa0 \\xc3\\xa1\\xcb\\x86\\xe2\\x80\\xba\\xc3\\xa1\\xcb\\x86\\xc2\\xad\\xc3\\xa1\\xc5\\xa0\\xe2\\x80\\xba' ---> \xc3\xa1\xc5\xa0 \xc3\xa1\xcb\x86\xe2\x80\xba\xc3\xa1\xcb\x86\xc2\xad\xc3\xa1\xc5\xa0\xe2\x80\xba
b'\\xc3\\x90\\xc2\\x90\\xc3\\x92\\xc2\\xa7\\xc3\\x91\\xc2\\x81\\xc3\\x91\\xcb\\x86\\xc3\\x93\\xe2\\x84\\xa2\\xc3\\x90\\xc2\\xb0' ---> \xc3\x90\xc2\x90\xc3\x92\xc2\xa7\xc3\x91\xc2\x81\xc3\x91\xcb\x86\xc3\x93\xe2\x84\xa2\xc3\x90\xc2\xb0
I don’t really know why Python does that. But thanks a lot for the reply!
Got it. Now I’m curious too so I tried it myself and got a similar result:
The raw line is: b'Afrikaans\n'
The decoded line is: Afrikaans
The raw line is: b'\\xe1\\x8a\\xa0\\xe1\\x88\\x9b\\xe1\\x88\\xad\\xe1\\x8a\\x9\n'
The decoded line is: \xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9
The raw line is: b'\\xd0\\x90\\xd2\\xa7\\xd1\\x81\\xd1\\x88\\xd3\\x99\\xd0\\xb0\n'
The decoded line is: \xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0
The raw line is: b'\\xd8\\xa7\\xd9\\x84\\xd8\\xb9\\xd8\\xb1\\xd8\\xa8\\xd9\\x8a\\xd8\\xa9\n'
The decoded line is: \xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9
The raw line is: b'Aragon\\xc3\\xa9s\n'
The decoded line is: Aragon\xc3\xa9s
The raw line is: b'Arpetan\n'
The decoded line is: Arpetan
The raw line is: b'Az\\xc9\\x99rbaycanca\n'
The decoded line is: Az\xc9\x99rbaycanca
The raw line is: b'Bamanankan'
The decoded line is: Bamanankan
As you can see the decode doesn’t do what I expected it to do. It just removed the byte-string (b’ ').
UPDATE: I think I get what the problem might be: when I read the lines the program escapes all \ so I get \\ which can’t be properly decoded. At least that’s what I think at the moment. You’ve figured that out already I guess but I was really blind and did not see it. Now to figure out why it does that.
I think I figured it out (thanks to stack overflow).
This is my source file (bytes.txt):
\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82
Afrikaans
\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0
\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9
Aragon\xc3\xa9s
This is my output:
τoρνoς <===> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
Afrikaans <===> b'Afrikaans'
Аҧсшәа <===> b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0'
العربية <===> b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'
Aragonés <===> b'Aragon\xc3\xa9s'
This is my code (quick and dirty):
def main(language_file):
line = language_file.readline().encode('raw_unicode_escape')
if line:
print_line(line)
return main(language_file)
def print_line(line):
lang = line.strip()
mystring = lang.decode()
mybytes = mystring.encode('utf-8')
print(mystring, "<===>", mybytes)
languages = open('bytes.txt', 'r', encoding='unicode_escape')
main(languages)
languages.close()
The trick is the unicode_escape
and raw_unicode_escape
.
Here is the link to stack overflow: https://stackoverflow.com/questions/36970312/read-xhh-escapes-from-file-as-raw-binary-in-python/36970420#36970420
I’m not sure I completely understand it, maybe @zedshaw or someone else can help us out here .
@KLPX Yup, that worked! Well, I now need to know why unicode_escape
and raw_unicode_escape
work but not UTF-8… Thanks a lot!!! I will see if I can find anything.
Hi, I recently encounter the same problem in decoding bytes from file to strings. I found that there are two parameters that are very important: unicode_escape
and raw_unicode_escape
. I will explain their functions in my codes. The structure of my codes are the same as Zed’s in ex23:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line0 = language_file.readline()
line = line0.encode("raw_unicode_escape")
# encode("raw_unicode_escape") is to encode a str like "\xxx\yyy\zzz"
# to a bytes looks same as b"\xxx\yyy\zzz", where b"" is just a symbol for bytes
# However, in order to have a clean "\xxx-like" str
# I first need to prevent python from automatically encode \ to \\ when reading files.
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
cooked_lang = line.decode()
print(cooked_lang)
languages = open("raw_bytes.txt", "r", encoding = "unicode_escape")
# This line is to open the file contains things like "\xxx\yyy\zzz"
# And make it ready for computer to read
# Then encode it with a encoding system called <unicode_escape>
# Which prevent python from automatically encode \ to \\ in strings.
main(languages, input_encoding, error)
Please forgive me to use my own word trying to explain these two parameters. I hope that it may help.