TLDR: Convert your problem file with Sublime Text by opening the file and using “Save with encoding” as utf-8
. Alternatively, use iconv -t UTF-8//TRANSLIT -c Zip_Zhvi_SingleFamilyResidence.csv > new_file.csv
When does this error happen?
I wanted to parse the housing data from Zillow at their research page. Zip code is a great measure of single family home real estate values.
However, when I download this data set as “Zip_Zhvi_SingleFamilyResidence.csv”, I could not simply load this data into pandas
.
This last line seemed like the clue:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 4: invalid continuation byte
Well, what format is that file?
Using a Mac, we can use file -I <file_name>
Oh, great! its “us-ascii”, we just pass that encoding
into pandas
right?
Oh maybe, I need to specify the encoding I want. WHY PANDAS, WHY!?
Why does this error happen?
Some encoding error has occurred, maybe because you accidentally opened Excel before opening ipython
or Zillow saves in a crazy format.
Awesome, lets just convert it
Let’s use the *nix program iconv
to convert the file. According to the man page (man iconv
), “The iconv program converts text form one encoding to another encoding. Great!
Let’s use this.
iconv -f us-ascii -t utf-8 < Zip_Zhvi_SingleFamilyResidence.csv > new_zip_code_file.csv
“cannot convert”
But iconv
, that’s your only job… you know, unix philosophy, one program, one job done well etc etc.
Turns out if you use “//TRANSLIT” appended to the encoding, characters are transliterated when needed and
possible (man page)
Solution 1 – iconv
with //TRANSLIT
> iconv -t UTF-8//TRANSLIT -c Zip_Zhvi_SingleFamilyResidence.csv > new_file.csv
> mv new_file.csv Zip_Zhvi_SingleFamilyResidence.csv
Solution 2 (easier to remember) – Sublime Text
Is there a better free editor than Sublime? Be a good citizen and buy your license.
Step 1: Open your file in Sublime Text
Step 2: Save with Encoding > UTF-8
DONE!
read_csv
to your hearts desire 🙂
ipython> data = pd.read_csv("new_file.csv")