Google App Engine:在数据存储上使用大查询？

Question

Google App Engine:在数据存储上使用大查询？

All*_* D. 15 google-app-engine google-bigquery

拥有一个GAE数据存储类,其中包含几个100'000的对象.想要做几个涉及的查询(涉及计数查询).Big Query似乎适合这样做.

目前有一种使用Big Query查询实时AppEngine数据存储区的简便方法吗？

Answer 1

您无法直接在DataStore实体上运行BigQuery,但您可以编写Mapper管道,从DataStore中读取实体,将它们写入Google云端存储中的CSV,然后将这些实体摄取到BigQuery中 - 您甚至可以自动执行该过程.以下是使用Mapper API类仅用于DataStore到CSV步骤的示例:

import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle

from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp

from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template

from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users

from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op

from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials


#Number of shards to use in the Mapper pipeline
SHARDS = 20

# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'

# DataStore Model
class YourEntity(db.Expando):
  field1 = db.StringProperty() # etc, etc

ENTITY_KIND = 'main.YourEntity'


class MapReduceStart(webapp.RequestHandler):
  """Handler that provides link for user to start MapReduce pipeline.
  """
  def get(self):
    pipeline = IteratorPipeline(ENTITY_KIND)
    pipeline.start()
    path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
    logging.info('Redirecting to: %s' % path)
    self.redirect(path)


class IteratorPipeline(base_handler.PipelineBase):
  """ A pipeline that iterates through datastore
  """
  def run(self, entity_type):
    output = yield mapreduce_pipeline.MapperPipeline(
      "DataStore_to_Google_Storage_Pipeline",
      "main.datastore_map",
      "mapreduce.input_readers.DatastoreInputReader",
      output_writer_spec="mapreduce.output_writers.FileOutputWriter",
      params={
          "input_reader":{
              "entity_kind": entity_type,
              },
          "output_writer":{
              "filesystem": "gs",
              "gs_bucket_name": GS_BUCKET,
              "output_sharding":"none",
              }
          },
          shards=SHARDS)


def datastore_map(entity_type):
  props = GetPropsFor(entity_type)
  data = db.to_dict(entity_type)
  result = ','.join(['"%s"' % str(data.get(k)) for k in props])
  yield('%s\n' % result)


def GetPropsFor(entity_or_kind):
  if (isinstance(entity_or_kind, basestring)):
    kind = entity_or_kind
  else:
    kind = entity_or_kind.kind()
  cls = globals().get(kind)
  return cls.properties()


application = webapp.WSGIApplication(
                                     [('/start', MapReduceStart)],
                                     debug=True)

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()

Run Code Online (Sandbox Code Playgroud)

如果将它附加到IteratorPipeline类的末尾:yield CloudStorageToBigQuery(output),可以将生成的csv文件句柄传递给BigQuery提取管道......如下所示:

class CloudStorageToBigQuery(base_handler.PipelineBase):
  """A Pipeline that kicks off a BigQuery ingestion job.
  """
  def run(self, output):

# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'

# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)

jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
    '%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
                    body=build_job_data(table_name,files)).execute()
logging.info(result)

def build_job_data(table_name, files):
  return {"projectId": PROJECT_ID,
          "configuration":{
              "load": {
                  "sourceUris": files,
                  "schema":{
                      # put your schema here
                      "fields": fields
                      },
                  "destinationTable":{
                      "projectId": PROJECT_ID,
                      "datasetId": DATASET_ID,
                      "tableId": table_name,
                      },
                  }
              }
          }

Run Code Online (Sandbox Code Playgroud)

Answer 2

dan*_*mux 7

使用新的(从2013年9月开始)流式插入API,您可以将应用程序中的记录导入BigQuery.

这些数据立即在BigQuery中提供,因此这应该满足您的实时要求.

虽然这个问题现在有点陈旧,但对于任何绊倒这个问题的人来说,这可能是一个更容易的解决方案

目前虽然从本地开发服务器开始工作,但最多也是不完整的.

Answer 3

Rya*_*oyd 5

我们正在做一个Trusted Tester程序,用于通过两个简单的操作从Datastore迁移到BigQuery:

使用数据存储管理员的备份功能备份数据存储
将备份直接导入BigQuery

它会自动为您处理架构.

更多信息(申请):https://docs.google.com/a/google.com/spreadsheet/viewform？formkey = dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ

归档时间：	13 年，8 月前
查看次数：	7174 次
最近记录：	9 年，8 月前