Thursday, November 10, 2011

Python: Detect Charset and convert to Unicode

Converting data to Unicode via chardet

import chardet
from lxml import html

f = open('my.xml')
content =

encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')

Converting Data to Unicode via UnicodeDammit

from BeautifulSoup import UnicodeDammit

converted = UnicodeDammit(content)
   if not converted.unicode: 
    raise UnicodeDecodeError( 
       "Failed to detect encoding, tried [%s]", 
        ', '.join(converted.triedEncodings))  
        # print converted.originalEncoding
         return converted.unicode

Credit goes to Ian Bicking and others on the lxml team

