SQLAlchemy的Unicode问题

Dav*_*gac 8 python unicode encoding sqlalchemy character-encoding

我知道我从Unicode转换有问题,但我不确定它在哪里发生.

我正在从HTML文件目录中提取有关最近Eruopean旅行的数据.某些位置名称具有非ASCII字符(例如é,ô,ü).我正在使用正则表达式从文件的字符串表示中获取数据.

如果我在找到它们时打印位置,它们会打印出字符,因此编码必须正常:

Le Pré-Saint-Gervais, France
Hôtel-de-Ville, France
Run Code Online (Sandbox Code Playgroud)

我使用SQLAlchemy将数据存储在SQLite表中:

Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))

    def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption):
        self.filename = filename
        self.pdate = pdate
        self.ptime = ptime
        self.location = location
        self.weather = weather
        self.high = high
        self.low = low
        self.lat = lat
        self.lon = lon
        self.image = image
        self.caption = caption

    def __repr__(self):
        return "<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime)

engine = create_engine('sqlite:///:memory:', echo=False)
Base.metadata.create_all(engine)
Session = sessionmaker(bind = engine)
session = Session()
Run Code Online (Sandbox Code Playgroud)

我遍历文件并将每个文件中的数据插入到数据库中:

for filename in filelist:

    # open the file and extract the information using regex such as:
    location_re = re.compile("<h2>(.*)</h2>",re.M)
    # extract other data

    newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption)
    session.add(newpoint)
    session.commit()
Run Code Online (Sandbox Code Playgroud)

我在每个插页上看到以下警告:

/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom'
  param.append(processors[key](compiled_params[key]))
Run Code Online (Sandbox Code Playgroud)

当我尝试对表做任何事情时,例如:

session.query(Point).all()
Run Code Online (Sandbox Code Playgroud)

我明白了:

Traceback (most recent call last):
  File "./extract_trips.py", line 131, in <module>
    session.query(Point).all()
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all
    return list(self)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances
    fetch = cursor.fetchall()
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall
    self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None
Run Code Online (Sandbox Code Playgroud)

我希望能够正确存储然后返回原始字符完整的位置名称.任何帮助将非常感激.

Dav*_*gac 11

我发现这篇文章有助于解释我的麻烦:

http://www.amk.ca/python/howto/unicode#reading-and-writing-unicode-data

通过使用'codecs'模块,然后按如下方式更改我的程序,我能够获得所需的结果:

打开文件时:

infile = codecs.open(filename, 'r', encoding='iso-8859-1')
Run Code Online (Sandbox Code Playgroud)

打印位置时:

print location.encode('ISO-8859-1')
Run Code Online (Sandbox Code Playgroud)

我现在可以查询和操作表中的数据,而不会出现前面的错误.我只需要在输出文本时指定编码.

(我仍然不完全理解这是如何工作的所以我想是时候了解Python的unicode处理...)


wor*_*ad3 7

对于unicode列,请尝试使用Unicode类型的Unicode而不是String:

Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))
Run Code Online (Sandbox Code Playgroud)

编辑:回复评论:

如果您收到有关unicode编码的警告,那么您可以尝试两件事:

  1. 将您的位置转换为unicode.这意味着你的Point创建如下:

    newpoint = Point(文件名,pdate,ptime,unicode(位置),天气,高,低,纬度,经度,图像,标题)

    当传递字符串或unicode字符串时,unicode转换将产生一个unicode字符串,因此您不必担心传入的内容.

  2. 如果这不能解决编码问题,请尝试在unicode对象上调用encode.这意味着使用如下代码:

    newpoint = Point(文件名,pdate,ptime,unicode(位置).encode('utf-8'),天气,高,低,纬度,经度,图像,标题)

    这个步骤可能不是必需的,但它本质上做的是将unicode对象从unicode代码点转换为特定的字节表示(在本例中为utf-8).我希望SQLAlchemy在您传入unicode对象时为您执行此操作,但它可能不会.


ral*_*nja 7

来自sqlalchemy.org

见0.4.2节

为String和create_engine()添加了新标志,断言_unicode =(True | False |'warn'| None).默认为FalseNone上创建_engine()和字符串,'warn'在Unicode的类型.当True将非unicode字节字符串作为绑定参数传递时 ,导致所有unicode转换操作引发异常.'警告'会发出警告.强烈建议所有支持unicode的应用程序正确使用Python unicode对象(即u'hello'而不是'hello'),以便数据准确地往返.

我想你正在尝试输入一个非unicode字节串.也许这可能会让你走上正轨?需要某种形式的转换,比较'hello'和u'hello'.

干杯