PySpark Markov模型的算法/编码帮助

Question

PySpark Markov模型的算法/编码帮助

nam*_*don 5 python algorithm machine-learning apache-spark pyspark

我需要一些帮助，让我的大脑设计出一个（高效的）火花火花链（通过python）。我已经尽力而为了，但是我想出的代码却无法扩展。.基本上，对于各个地图阶段，我都编写了自定义函数，它们对于几千个序列都可以正常工作，但是当我们得到在20,000多个（我最多有80万个）中，速度变慢了。

对于那些不熟悉Markov moodels的人来说，这就是要点。

这是我的数据。此时，我已经在RDD中获得了实际数据（没有标题）。

ID, SEQ
500, HNL, LNH, MLH, HML

Run Code Online (Sandbox Code Playgroud)

我们看一下元组中的序列

(HNL, LNH), (LNH,MLH), etc..

Run Code Online (Sandbox Code Playgroud)

我需要到达这一点..在这里，我返回一个字典（针对每一行数据），然后将其序列化并存储在内存数据库中。

{500:
    {HNLLNH : 0.333},
    {LNHMLH : 0.333},
    {MLHHML : 0.333},
    {LNHHNL : 0.000},
    etc..
}

Run Code Online (Sandbox Code Playgroud)

因此，从本质上讲，每个序列都与下一个序列结合（HNL，LNH变为“ HNLLNH”），然后对于所有可能的过渡（序列的组合），我们对它们的出现进行计数，然后除以过渡总数（在这种情况下为3）并获取它们的出现频率。

上面有3个转换，其中一个是HNLLNH。因此，对于HNLLNH，1/3 = 0.333

顺便说一句，我不确定是否相关，但是序列中每个位置的值都受到限制。.第一位置（H / M / L），第二位置（M / L），第三位置（H ，M，L）。

我的代码以前做过的事情是collect（）rdd，并使用我编写的函数将其映射两次。这些功能首先将字符串转换为列表，然后将list [1]与list [2]合并，然后将list [2]与list [3]合并，然后将list [3]与list [4]合并，依此类推。像这样

[HNLLNH],[LNHMLH],[MHLHML], etc..

Run Code Online (Sandbox Code Playgroud)

然后，下一个函数使用该列表项作为关键字从该列表中创建一个词典，然后计算整个列表中该关键字的总出现次数，除以len（list）以获得频率。然后，我将该字典和它的ID号一起包装在另一个字典中（导致第二个代码块，在上方）。

就像我说的那样，这对于小序列的序列来说效果很好，但是对于长度超过100k的列表而言效果不佳。

另外，请记住，这只是一行数据。我必须在10-20k行数据的任何位置执行此操作，每行数据的长度在500-800,000个序列之间变化。

关于如何编写pyspark代码（使用API map / reduce / agg / etc ..函数）以有效地做到这一点的任何建议？

编辑代码如下。从底部开始可能是有意义的。请记住，我正在学习这方面的知识（Python和Spark），而我并不是为了谋生而这样做，所以我的编码标准不是很好。

def f(x):
    # Custom RDD map function
    # Combines two separate transactions
    # into a single transition state

    cust_id = x[0]
    trans = ','.join(x[1])
    y = trans.split(",")
    s = ''
    for i in range(len(y)-1):
        s= s + str(y[i] + str(y[i+1]))+","
    return str(cust_id+','+s[:-1])

def g(x):
    # Custom RDD map function
    # Calculates the transition state probabilities
    # by adding up state-transition occurrences
    # and dividing by total transitions
    cust_id=str(x.split(",")[0])
    trans = x.split(",")[1:]
    temp_list=[]
    middle = int((len(trans[0])+1)/2)
    for i in trans:
        temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )

    state_trans = {}
    for i in temp_list:
            state_trans[i] = temp_list.count(i)/(len(temp_list))

    my_dict = {}
    my_dict[cust_id]=state_trans
    return my_dict


def gen_tsm_dict_spark(lines):
    # Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
    # Returns RDD of dict with CUST_ID and tsm per customer
    #  i.e.  {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}

    # creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
    cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))

    with_seq = cust_trans.map(f)

    full_tsm_dict = with_seq.map(g)

    return full_tsm_dict


def main():
result = gen_tsm_spark(my_rdd)

# Insert into DB
for x in result.collect():
    for k,v in x.iteritems():
         db_insert(k,v)

Run Code Online (Sandbox Code Playgroud)

Answer 1

zer*_*323 2

您可以尝试如下所示的操作。它很大程度上依赖于tooolz外部依赖，但如果您希望避免外部依赖，您可以轻松地将其替换为一些标准 Python 库。

from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge

# Generate all possible transitions 
defaults = sc.broadcast(dict(map(
    lambda x: ("".join(concat(x)), 0.0), 
    product(product("HNL", "NL", "HNL"), repeat=2))))

rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])

def process(line):
    """
    >>> process("000, HHH, LLL, NNN")
    ('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
    """
    bits = line.split(", ")
    transactions = bits[1:]
    n = len(transactions) - 1
    frequencies = pipe(
        sliding_window(2, transactions), # Get all transitions
        map(lambda p: "".join(p)), # Joins strings
        Counter, # Count 
        lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
    )
    return bits[0], frequencies

def store_partition(iter):
    for (k, v) in iter:
        db_insert(k, merge([defaults.value, v]))

rdd.map(process).foreachPartition(store_partition)

Run Code Online (Sandbox Code Playgroud)

由于您知道所有可能的转换，我建议使用稀疏表示并忽略零。此外，您可以用稀疏向量替换字典以减少内存占用。

归档时间：	10 年，3 月前
查看次数：	876 次
最近记录：	6 年，8 月前