Ex 23, Deep Dive challenge #3: converting text strings to binary without adding escape characters

dswalen · August 6, 2020, 4:08am

I enjoyed Ex 23. I got the script pretty much right away (and commented the hell out of it to ram it home) so I moved on to the extra credit stuff. I hit a brick wall with #3 and after hours of trying to research this online, I’ve thrown in the towel and come here.

I know you can reverse the process by importing the languages.txt file as binary and then go do your encode/decode that way. But I read the exercise literally so I considered that method to be a bit of a cheat and that what reversing the process really should be was having a text file full of the binary string outputs of languages.txt and starting there.

It was relatively easy to write a small script that took languages.txt and did the encode of the languages.txt file lines and spit them out to a file which I called binaries.txt

Binaries.txt looked like this when you opened it… (this is just a snippet)

b'Afrikaans'
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0'
b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'
b'Aragon\xc3\xa9s'
b'Arpetan'
b'Az\xc9\x99rbaycanca'
b'Bamanankan'
b'\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbe'

The hard part, and what has caused me to draw a blank despite numerous attempts at solving it, is getting each line to format correctly as binary.

The problem is if I do this…

binaries = open("binaries.txt", encoding="utf-8")

The opened file’s lines will come in as strings. At that point if you try to convert a line as binary
a whole bunch of escape characters will be added and instead of seeing this

b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'

I end up with this…

b"b'\\xe1\\x8a\\xa0\\xe1\\x88\\x9b\\xe1\\x88\\xad\\xe1\\x8a\\x9b'"

And of course there’s no way to decode that into its expected utf-8 string text counterpart.

The same thing happens if you try to open the file with “r+b”

binaries = open("binaries.txt", "r+b")

you’ll get a bunch of escape characters added to your line.

Now, after researching this, I totally understand what’s going on here and why it’s doing that and it makes perfect sense.

But it doesn’t get me any closer to solving my problem. Surely there must be a way to read from a file what are essentially binary strings (albeit in text string format) into Python and flip them over to binary strings without corrupting the line with a bunch of escape characters. But for the life of me I haven’t found it and I’m all out of ideas (and have hit the red vino to boot).

Maybe I’m overthinking this and going with the first option at the top is the easiest way out (obviously) but now that I’ve gotten this far and spent this much time on it I really want to know the solution or at least to understand what it would be.

So if anyone can throw me a clue that would be great. Meantime I’m off to Ex24 because I’ve done everything I can with Ex23 now.

florian · August 6, 2020, 8:52am

Kudos for keeping at it!

There is, and as far as I can tell, you’re doing exactly that.

Only, your binaries.txt does not contain what you call “essentially binary strings” of the languages file: those are string representations of Python’s bytes objects. When you read it back, \xe1 is parsed as four bytes objects/characters, not as the single bytes object that the escape sequence refers to, so the backslash is escaped.

Try creating a new binary file: Make sure you’re opening the file in binary mode and write bytes objects to it. If you open the file, you should not see each line starting with b'....

Let us know if this helps!

florian · August 6, 2020, 9:12am

Oh, wait, did I just tell you what you already know? Shoot.

What you could do is read the whole file as strings and use exec. Try this in the REPL:

>>> s = "b'\\xe1\\x8a'"    # string representation of a bytes object
>>> stmt = "b = " + s      # the assignment statement
>>> exec(stmt)             # execute it
>>> b                      # now b is the bytes object
b'\xe1\x8a'

dswalen · August 6, 2020, 4:31pm

That alone isn’t going to work for one reason…your code has your byte object as a string to start out with. Mine will be bytes because the binaries.txt file was opened with “r+b” which is what introduced those escape sequences in the first place.

So when your code gets here…

>>> stmt = "b = " + s      # the assignment statement

python throws an error because you are trying to combine "b = ", a string, with s, bytes. s first needs to be converted back to string before you can add it to "b = " and assign the addition to stmt. I’m trying to research how to do that right now, if it’s feasible.

florian · August 6, 2020, 5:26pm

Yes but what is stopping you from opening the file in “normal” mode so you get strings? Wouldn’t that be the logical thing to do? You have a file that contains string data, read it as such and convert if necessary. Or write the data to a binary file and read it as binary.

If you mix things up, it’ll be awkward either way.

dswalen · August 6, 2020, 5:59pm

Yes, I realized that so I did try to open the file as string. But it’s still not working.

This is the code that generates the binaries.txt file. Just so you can see where my starting point is with that binaries.txt file

# this script will take languages.txt and convert the lines bytes objects
# and save them as string representations to a file

# import the sys module
import sys

script, input_encoding, error = sys.argv

def main(language_file, encoding, errors):

    # this will read the current line in the file
    # if there is no line - if EOF is encountered immediately
    # an empty string is returned
    line = language_file.readline()

    # if there is a line in the file and EOF has not been encountered
    if line:

        # line.strip() is called here to ensure that only the string is used
        # and any whitespace characters are removed
        next_lang = line.strip()

        # take the line in the languages.txt file and encode it to a bytes string
        raw_bytes = next_lang.encode(encoding, errors=errors)

        # write the bytes string line to binaries.txt
        binaries_file.write('{}\n'.format(raw_bytes))

        # at the end the function it will call it again while lines remain in
        # the languages.txt file
        return main(language_file, encoding, errors)

    else:

        # close the binaries.txt file
        binaries_file.close()

# defines languages.txt as the file to be opened and specifies the
# encoding Python will use as utf-8
languages = open("languages.txt", encoding="utf-8")

# opens/creates the binaries.txt file 
binaries_file = open("binaries.txt", 'w', encoding="utf-8")

main(languages, input_encoding, error)

So from there I go to the next script. What follows is just the part having to do with reading the lines as strings and trying to convert to bytes. I removed the rest of the code to focus on this part as this is what’s blocking me.

# this script will take binaries.txt created in ex23a.py and convert
# each binary line to text strings and print both to screen with the
# text strings coming first

import sys
script, input_encoding, error = sys.argv

def main(binaries_file, encoding, errors):

    line = binaries_file.readline()

    # there are two readline() commands here because the first line
    # in binaries.txt is "b'Afrikaans" which isn't as useful for test purposes
    # since you can already read it so a second readline() is called to get to
    # the next line of data which is more useful
    line = binaries_file.readline()

    # strip out any whitespace characters
    stripped = line.strip()
    stmt = "b = " + stripped

    # print out stmt to see what we have
    # also print out the type of stmt to see that it's a string
    print(stmt)
    print(type(stmt))

    # you can't just exec(stmt) without assigning the result to something
    # so I assign it to next_bin
    next_bin = exec(stmt)

    # print out next_bin to see what we have
    # also print out the type of next_bin to see if it's bytes
    print(next_bin)
    print(type(next_bin))

# defines binaries.txt as the file to be opened with 'r+b'
binaries = open("binaries.txt", 'r', encoding="utf-8")

main(binaries, input_encoding, error)

When I run this code, Powershell gives me this

PS C:\Users\dswal\AppData\Local\Programs\Python\Python36\Doug> python test.py utf-8 strict
b = b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
<class 'str'>
None
<class 'NoneType'>
PS C:\Users\dswal\AppData\Local\Programs\Python\Python36\Doug>

The code seems okay until it gets here

next_bin = exec(stmt)

Can’t just run exec(stmt) on its own because there’s no way to move forward because there is no variable that contains the result so I assigned it to next_bin. But the result appears to be junk.

Even when I tried to do this through Powershell line by line it still threw an error. And I don’t understand why Python couldn’t print line properly since it was a string. The only think I can think of is I pasted that line in to Powershell rather than type it out and maybe that introduced something?

PS C:\Users\dswal\AppData\Local\Programs\Python\Python36\Doug> python
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> line = "b'\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f'"
>>> print(line)
b'ÐÐµÐ»Ð°ÑÑÑÐºÐ°Ñ'
>>> print("b'\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f'")
b'ÐÐµÐ»Ð°ÑÑÑÐºÐ°Ñ'
>>> print(type(line))
<class 'str'>
>>> stripped = line.strip()
>>> print(stripped)
b'ÐÐµÐ»Ð°ÑÑÑÐºÐ°Ñ'
>>> print(type(stripped))
<class 'str'>
>>> stmt = "b = " + stripped
>>> print(stmt)
b = b'ÐÐµÐ»Ð°ÑÑÑÐºÐ°Ñ'
>>> print(type(stmt))
<class 'str'>
>>> next_bin = exec(stmt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>> exec(stmt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>> print(stmt)
b = b'ÐÐµÐ»Ð°ÑÑÑÐºÐ°Ñ'
>>> exec(stmt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>>
KeyboardInterrupt
>>> ^Z

As of now I’m fresh out of ideas.

florian · August 6, 2020, 7:19pm

Yeah well, exec is often kind of hacky, especially if you want to craft statements dynamically.
But: if you do it like this, you don’t need next_bin, the bytes object will be stored in the variable b.

The problem in your powershell dump is that the escape characters in the string now count as single chars, but they are outside the ascii range so bytes objects can’t store them. That’s why they appear as escape sequences in the original bytes objects to begin with.

This is a bit of a fix that I didn’t really think about earlier. Off the top of my head I can’t think of a good way around this. There might be some nifty string magic you can do… but honestly, I think if you really need bytes objects, then you’re better off saving the file accordingly to begin with, thus avoiding the whole conversion roundtrip.

dswalen · August 6, 2020, 8:24pm

Probably should save the file that way. But I think I’m probably too burned out on what has become too much of a theoretical exercise at this point to want to fix it. On to ex24…

florian · August 6, 2020, 8:40pm

Wise move.

I think the only time I’ve actually used bytes objects in Python was a few years ago in this exercise. After all, high-level languages like Python exist just so we don’t need to wrestle with things like byte arrays.

zedshaw · August 10, 2020, 1:40pm

Yeah you kind of totally destroyed this exercise. Good learning experience but there’s more further down.