Thursday, November 10, 2011



Python: Detect Charset and convert to Unicode


Converting data to Unicode via chardet 
http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution

import chardet
from lxml import html

f = open('my.xml')
content = f.read()
f.close()

encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')

Converting Data to Unicode via UnicodeDammit 
http://lxml.de/elementsoup.html

from BeautifulSoup import UnicodeDammit

converted = UnicodeDammit(content)
   if not converted.unicode: 
    raise UnicodeDecodeError( 
       "Failed to detect encoding, tried [%s]", 
        ', '.join(converted.triedEncodings))  
        # print converted.originalEncoding
         return converted.unicode


Credit goes to Ian Bicking and others on the lxml team

No comments: