UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf1 in position 4: invalid continuation byte

TLDR: Convert your problem file with Sublime Text by opening the file and using “Save with encoding” as utf-8. Alternatively, use iconv -t UTF-8//TRANSLIT -c Zip_Zhvi_SingleFamilyResidence.csv > new_file.csv

When does this error happen?

I wanted to parse the housing data from Zillow at their research page. Zip code is a great measure of single family home real estate values.

zillow research page time series by zipcode.png

However, when I download this data set as “Zip_Zhvi_SingleFamilyResidence.csv”, I could not simply load this data into pandas.

pandas_read_csv_UnicodeDecodeError.png

This last line seemed like the clue:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 4: invalid continuation byte

Well, what format is that file?

Using a Mac, we can use file -I <file_name>

file_type.png

Oh, great! its “us-ascii”, we just pass that encoding into pandas right?

pandas_error_with_encoding.png

Oh maybe, I need to specify the encoding I want. WHY PANDAS, WHY!?

pandas_error_with_encoding_again.png

Why does this error happen?

Some encoding error has occurred, maybe because you accidentally opened Excel before opening ipython or Zillow saves in a crazy format.

Awesome, lets just convert it

Let’s use the *nix program iconv to convert the file. According to the man page (man iconv), “The iconv program converts text form one encoding to another encoding. Great!man_iconv.png

Let’s use this.

iconv -f us-ascii -t utf-8 < Zip_Zhvi_SingleFamilyResidence.csv > new_zip_code_file.csv

iconv_failure.png

“cannot convert”

But iconv, that’s your only job… you know, unix philosophy, one program, one job done well etc etc.

Turns out if you use “//TRANSLIT” appended to the encoding, characters are transliterated when needed and
possible (man page)

Solution 1 – iconv with //TRANSLIT

> iconv -t UTF-8//TRANSLIT -c Zip_Zhvi_SingleFamilyResidence.csv > new_file.csv

> mv new_file.csv Zip_Zhvi_SingleFamilyResidence.csv

Solution 2 (easier to remember) – Sublime Text

Is there a better free editor than Sublime? Be a good citizen and buy your license.

Step 1: Open your file in Sublime Text

Step 2: Save with Encoding > UTF-8

DONE!

grizz_celebrating.gif

read_csv to your hearts desire 🙂

ipython> data = pd.read_csv("new_file.csv")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s