使用Python从PowerPivot模型中提取原始数据

aka*_*lar 12 python excel ssas ado powerpivot

当我不得不使用Python从PowerPivot模型中读取一些数据时,看起来像是一项微不足道的任务变成了一场真正的噩梦.我相信我在过去的几天里对此进行了很好的研究,但现在我遇到了一堵砖墙,并希望得到Python/SSAS/ADO社区的一些帮助.

基本上,我想要的是以编程方式访问存储在PowerPivot模型中的原始数据 - 我的想法是通过下面列出的方法之一连接到底层的PowerPivot(即MS Analysis Services)引擎,列出模型中包含的表,然后使用简单的DAX查询(类似EVALUATE (table_name))从每个表中提取原始数据.容易腻,对吗?好吧,也许不是.

0.一些背景资料

如您所见,我尝试了几种不同的方法.我会尝试尽可能仔细地记录所有内容,以便那些不熟悉PowerPivot功能的人会很清楚我想做什么.

首先,有关编程访问Analysis Services引擎的一些背景知识(它说2005 SQL Server,但它应该仍然适用):SQL Server数据挖掘可编程性用于Analysis Services连接的数据提供程序.

我将在下面的示例中使用的示例Excel/PowerPivot文件可以在此处找到:Microsoft PowerPivot for Excel 2010和Excel 2013示例中的PowerPivot.

另请注意,我使用的是Excel 2010,因此我的一些代码是特定于版本的.例如,wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection应该是wb.Model.DataModelConnection.ModelConnection.ADOConnection,如果你使用Excel 2013.

我将在整个问题中使用的连接字符串基于以下信息:使用C#连接到PowerPivot引擎.另外,一些方法显然需要在数据检索之前对PowerPivot模型进行某种初始化.请参见此处:从VBA自动执行PowerPivot Refresh操作.

最后,这里有几个链接显示这应该是可以实现的(但请注意,这些链接主要是指C#,而不是Python):

1.使用ADOMD

import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()
Run Code Online (Sandbox Code Playgroud)

在这里,似乎问题是PowerPivot模型尚未初始化:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.
Run Code Online (Sandbox Code Playgroud)

2.使用AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)
Run Code Online (Sandbox Code Playgroud)

同样的故事,"服务器没有运行":

ConnectionException: A connection cannot be made. Ensure that the server is running.
Run Code Online (Sandbox Code Playgroud)

请注意,AMO在技术上不用于查询数据,但我将其作为连接PowerPivot模型的潜在方式之一.

3.使用ADO.NET

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()
Run Code Online (Sandbox Code Playgroud)

这类似于使用python或ironpython访问mssql的最简单方法什么?.不幸的是,这也行不通:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.
Run Code Online (Sandbox Code Playgroud)

4.通过adodbapi模块使用ADO

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)
Run Code Online (Sandbox Code Playgroud)

类似于Python和MS Access VBA之间的OLEDB/ODBC的相反工作方式.我得到的错误是:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...
Run Code Online (Sandbox Code Playgroud)

这与上面的ADO.NET基本相同.

5.通过Excel/win32com模块使用ADO

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)
Run Code Online (Sandbox Code Playgroud)

这种方法的想法来自这篇使用VBA的博客文章:使用VBA 将Power Pivot中的表格或DAX查询导出为CSV.请注意,此方法使用显式刷新命令来初始化模型(即"服务器").这是错误消息:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)
Run Code Online (Sandbox Code Playgroud)

但是,看来ADO连接已经建立:

  • type(Connection) 回报 instance
  • print(Connection) 回报 Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue

似乎问题在于创建ADODB.Recordset对象.

6.通过Excel/win32com使用ADO,直接使用ADODB.Connection

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)
Run Code Online (Sandbox Code Playgroud)

类似于Python中的Connection to Access [复制]Win32平台中使用ADO的查询访问(Python配方).不幸的是,Python吐出的错误与上面两个例子中的相同:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)
Run Code Online (Sandbox Code Playgroud)

7.通过Excel/win32com使用ADO,直接使用ADODB.Connection和模型刷新

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)
Run Code Online (Sandbox Code Playgroud)

我希望我可以初始化Excel实例,然后初始化PowerPivot模型,然后使用Excel用于嵌入式PowerPivot数据的内部连接字符串创建连接(类似于如何将powerpivot数据作为表复制到excel工作簿中? - 请注意,连接字符串与我在其他地方使用的字符串不同.不幸的是,这不起作用,我的猜测是Python在一个单独的实例中启动ADODB.Connection进程(因为当我执行最后三行而没有先初始化Excel时,我得到相同的错误消息,等等):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)
Run Code Online (Sandbox Code Playgroud)

aka*_*lar 9

瞧,我终于设法解决了这个问题 - 事实证明使用Python访问Power Pivot数据确实是可能的!下面是我所做的简短回顾 - 你可以在这里找到更详细的描述:分析服务(SSAS).注意:代码已经过优化,无论是效率还是优雅.

  • 安装Microsoft Power BI Desktop(免费提供Analysis Services服务器,因此无需昂贵的SQL Server许可证 - 但是,如果您拥有适当的许可证,同样的方法显然也可以使用).
  • 首先创建msmdsrv.ini设置文件,然后从ABF文件(使用AMO.NET)恢复数据库,然后使用ADOMD.NET提取数据,从而启动AS引擎.

这是演示AS引擎+ AMO.NET部分的Python代码:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process
Run Code Online (Sandbox Code Playgroud)

而数据提取部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf
Run Code Online (Sandbox Code Playgroud)

然后通过以下方式提取原始数据:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)
Run Code Online (Sandbox Code Playgroud)