Python: Detect Charset and convert to Unicode
Converting data to Unicode via chardet
http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution
import chardet
from lxml import html
f = open('my.xml')
content = f.read()
f.close()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
content = content.decode(encoding, 'replace').encode('utf-8')
Converting Data to Unicode via UnicodeDammit
http://lxml.de/elementsoup.htmlfrom BeautifulSoup import UnicodeDammit
converted = UnicodeDammit(content)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
Credit goes to Ian Bicking and others on the lxml team
No comments:
Post a Comment