LPTHW - Exercise 23 (UTF-8 Error)

Aahrvenos · February 23, 2019, 7:01pm

For anyone getting errors similar to the one below

python ex23.py utf-8 strict
Traceback (most recent call last):
  File "ex23.py", line 23, in <module>
    main(languages, input_encoding, error)
  File "ex23.py", line 6, in main
    line = language_file.readline()
  File "C:\Users\Aahrvenos\AppData\Local\Programs\Python\Python37-32\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-12: invalid continuation byte

The issue is that you have to actually download the text file (CTRL-S on the web page where the pdf directs you) I copied the text and put it in notepad then saved it as languages.txt. Not sure why that doesn’t work but yeah

zedshaw · February 24, 2019, 5:07pm

Ahhhhhhhh yes that would most likely save it as your computer’s encoding, which would really only work in the US and UK. Everyone else would be in trouble.

Aahrvenos · February 24, 2019, 5:26pm

Now that’s a handy bit of info. I happen to live in the U.S. and so I can only assume there’s something funky going on with my computer. Either way, happy for the resolution, and the extra tidbit!

zedshaw · March 6, 2019, 10:25pm

Yes, either you’re using something like utf-16 or some other encoding. PowerShell is weird.

Aahrvenos · March 6, 2019, 11:07pm

Gotcha

Some time ago someone on the forum recommended using Cmder to execute the code. I can confirm it works flawlessly with Example 23.

link to Cmder: https://cmder.net/

I’ve since moved on from this exercise but I did check to see what encoding my pc was using. I checked with the powershell command:

[System.Text.Encoding]::Default

… and it says it is using Bodyname (iso-8859-1) and CodePage (1252) which according to Wikipedia is an 8-bit encoding… So I can also confirm… Powershell is weird… Why it wouldn’t display with copy/paste is once again beyond me

zedshaw · April 5, 2019, 11:58pm

Yeah, cmdr is great. The installer is terrible but once you get past that it works.

littleman · August 3, 2020, 7:39am

import sys 

script , input_encoding, error=sys.argv

def main(language_file, encoding, errors):
     line=language_file.readline()

     if line:
         print_line(line, encoding, errors)
         return main(language_file, encoding, errors)

def print_line(line, encoding, errors):
    next_lang = line.strip()
    raw_bytes = next_lang.encode(encoding, errors=errors)
    cooked_bytes = raw_bytes.decode(encoding, errors=errors)

    print (raw_bytes ,   "<===>" , cooked_bytes)

languages=open("languages.txt", encoding= "utf-16")

main(languages, input_encoding, error)

and it gives me o/p with following error! whats the issue here

lpthw>python ex23.py utf-16 languages.txt
Traceback (most recent call last):
  File "ex23.py", line 21, in <module>
    main(languages, input_encoding, error)
  File "ex23.py", line 6, in main
    line=language_file.readline()
  File "C:\Program Files (x86)\Python37-32\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "C:\Program Files (x86)\Python37-32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 812-813: illegal UTF-16 surrogate

zedshaw · August 10, 2020, 1:28pm

Hi @littleman, in the future can you please start new threads instead of hijacking old ones. Your issue is different enough that it needs a separate thread. Also, do this with your code:

[code]
# your code here
[/code]

That way it’s formatted well.

Answer

You seem to have added encoding="utf-16" but it should be encoding="utf-8". Can you explain why you made that change? Does it work when you change it back?