用动态规划实现文本对齐

Question

用动态规划实现文本对齐

ber*_*zie 20 algorithm dynamic-programming python-3.x

我想了解动态规划的概念,通过对MIT OCW课程在这里.关于OCW视频的解释很棒,但我觉得在我将解释实现到代码之前我并不理解它.在实施时,我会参考这里的讲义中的一些注释,特别是注释的第3页.

问题是,我不知道如何将一些数学符号转换为代码.这是我实施的解决方案的一部分(并认为它是正确实现的):

import math

paragraph = "Some long lorem ipsum text."
words = paragraph.split(" ")

# Count total length for all strings in a list of strings.
# This function will be used by the badness function below.
def total_length(str_arr):
    total = 0

    for string in str_arr:
        total = total + len(string)

    total = total + len(str_arr) # spaces
    return total

# Calculate the badness score for a word.
# str_arr is assumed be send as word[i:j] as in the notes
# we don't make i and j as argument since it will require
# global vars then.
def badness(str_arr, page_width):
    line_len = total_length(str_arr)
    if line_len > page_width:
        return float('nan') 
    else:
        return math.pow(page_width - line_len, 3)

Run Code Online (Sandbox Code Playgroud)

现在我不理解的部分是在讲义中的第3点到第5点.我实际上不明白,也不知道从哪里开始实现这些.到目前为止,我已经尝试迭代单词列表,并计算每个所谓的行尾的坏处,如下所示:

def justifier(str_arr, page_width):
    paragraph = str_arr
    par_len = len(paragraph)
    result = [] # stores each line as list of strings
    for i in range(0, par_len):
        if i == (par_len - 1):
            result.append(paragraph)
        else:
            dag = [badness(paragraph[i:j], page_width) + justifier(paragraph[j:], page_width) for j in range(i + 1, par_len + 1)] 
            # Should I do a min(dag), get the index, and declares it as end of line?

Run Code Online (Sandbox Code Playgroud)

但是,我不知道如何继续这个功能,说实话,我不明白这一行:

dag = [badness(paragraph[i:j], page_width) + justifier(paragraph[j:], page_width) for j in range(i + 1, par_len + 1)]

Run Code Online (Sandbox Code Playgroud)

以及如何我将回到justifier作为一个int(因为我已经决定来存储返回值result,这是一个名单,我应该从那里另一个功能和递归？如果有任何递归呢？

你能告诉我下一步该做什么,并解释这是如何动态编程的吗？我真的看不出递归的位置,以及子问题是什么.

谢谢你.

Answer 1

Joo*_*wan 21

如果您无法理解动态编程本身的核心思想,请参考以下内容:

动态编程本质上牺牲了时间复杂度的空间复杂度(但是与您节省的时间相比,您使用的额外空间通常非常少,如果正确实现,动态编程总是值得的).您可以随时存储每个递归调用的值(例如,在数组或字典中),这样当您在递归树的另一个分支中遇到相同的递归调用时,可以避免第二次计算.

而且不,你不必须使用递归.以下是我正在使用循环的问题的实现.我非常密切地关注了由AlexSilva链接的TextAlignment.pdf.希望你觉得这很有帮助.



def length(wordLengths, i, j):
    return sum(wordLengths[i- 1:j]) + j - i + 1


def breakLine(text, L):
    # wl = lengths of words
    wl = [len(word) for word in text.split()]

    # n = number of words in the text
    n = len(wl)    

    # total badness of a text l1 ... li
    m = dict()
    # initialization
    m[0] = 0    

    # auxiliary array
    s = dict()

    # the actual algorithm
    for i in range(1, n + 1):
        sums = dict()
        k = i
        while (length(wl, k, i) <= L and k > 0):
            sums[(L - length(wl, k, i))**3 + m[k - 1]] = k
            k -= 1
        m[i] = min(sums)
        s[i] = sums[min(sums)]

    # actually do the splitting by working backwords
    line = 1
    while n > 1:
        print("line " + str(line) + ": " + str(s[n]) + "->" + str(n))
        n = s[n] - 1
        line += 1

Run Code Online (Sandbox Code Playgroud)

Answer 2

Suu*_*hgi 7

对于任何人在此仍然有兴趣：关键是从文本的末尾向后移动（如提到这里）。如果这样做，您只需比较已经记忆的元素。

说，words是要包装的字符串列表textwidth。然后，在讲义中，任务减少为三行代码：

import numpy as np

textwidth = 80

DP = [0]*(len(words)+1)

for i in range(len(words)-1,-1,-1):
    DP[i] = np.min([DP[j] + badness(words[i:j],textwidth) for j in range(i+1,len(words)+1)])

Run Code Online (Sandbox Code Playgroud)

带有：

def badness(line,textwidth):

    # Number of gaps
    length_line = len(line) - 1

    for word in line:
        length_line += len(word)

    if length_line > textwidth: return float('inf')

    return ( textwidth - length_line )**3

Run Code Online (Sandbox Code Playgroud)

他提到，可以添加第二个列表来跟踪中断位置。您可以通过将代码更改为：

DP = [0]*(len(words)+1)
breaks = [0]*(len(words)+1)

for i in range(len(words)-1,-1,-1):
    temp = [DP[j] + badness(words[i:j],args.textwidth) for j in range(i+1,len(words)+1)]

    index = np.argmin(temp)

    # Index plus position in upper list
    breaks[i] = index + i + 1
    DP[i] = temp[index]

Run Code Online (Sandbox Code Playgroud)

要恢复文本，只需使用中断位置列表：

def reconstruct_text(words,breaks):                                                                                                                

    lines = []
    linebreaks = []

    i = 0 
    while True:

        linebreaks.append(breaks[i])
        i = breaks[i]

        if i == len(words):
            linebreaks.append(0)
            break

    for i in range( len(linebreaks) ):
        lines.append( ' '.join( words[ linebreaks[i-1] : linebreaks[i] ] ).strip() )

    return lines

Run Code Online (Sandbox Code Playgroud)

结果：（text = reconstruct_text(words,breaks)）

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet
clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit
amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed
diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet
clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Run Code Online (Sandbox Code Playgroud)

可能会想添加一些空格。这是非常棘手的（因为可能会提出各种美学规则），但是天真的尝试可能是：

import re

def spacing(text,textwidth,maxspace=4):

    for i in range(len(text)):

        length_line = len(text[i])

        if length_line < textwidth:

            status_length = length_line
            whitespaces_remain = textwidth - status_length
            Nwhitespaces = text[i].count(' ')

            # If whitespaces (to add) per whitespace exeeds
            # maxspace, don't do anything.
            if whitespaces_remain/Nwhitespaces > maxspace-1:pass
            else:
                text[i] = text[i].replace(' ',' '*( 1 + int(whitespaces_remain/Nwhitespaces)) )
                status_length = len(text[i])

                # Periods have highest priority for whitespace insertion
                periods = text[i].split('.')

                # Can we add a whitespace behind each period?
                if len(periods) - 1 + status_length <= textwidth:
                    text[i] = '. '.join(periods).strip()

                status_length = len(text[i])
                whitespaces_remain = textwidth - status_length
                Nwords = len(text[i].split())
                Ngaps = Nwords - 1

                if whitespaces_remain != 0:factor = Ngaps / whitespaces_remain

                # List of whitespaces in line i
                gaps = re.findall('\s+', text[i])

                temp = text[i].split()
                for k in range(Ngaps):
                    temp[k] = ''.join([temp[k],gaps[k]])

                for j in range(whitespaces_remain):
                    if status_length >= textwidth:pass
                    else:
                        replace = temp[int(factor*j)]
                        replace = ''.join([replace, " "])
                        temp[int(factor*j)] = replace

                text[i] = ''.join(temp)

    return text

Run Code Online (Sandbox Code Playgroud)

是什么让您：（text = spacing(text,textwidth)）

Lorem  ipsum  dolor  sit  amet, consetetur  sadipscing  elitr,  sed  diam nonumy
eirmod  tempor  invidunt  ut labore  et  dolore  magna aliquyam  erat,  sed diam
voluptua.   At  vero eos  et accusam  et justo  duo dolores  et ea  rebum.  Stet
clita  kasd  gubergren,  no  sea  takimata sanctus  est  Lorem  ipsum  dolor sit
amet.   Lorem  ipsum  dolor  sit amet,  consetetur  sadipscing  elitr,  sed diam
nonumy  eirmod  tempor invidunt  ut labore  et dolore  magna aliquyam  erat, sed
diam  voluptua.  At vero eos et accusam et  justo duo dolores et ea rebum.  Stet
clita  kasd gubergren, no sea  takimata sanctus est Lorem  ipsum dolor sit amet.

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，6 月前
查看次数：	15774 次
最近记录：	7 年，2 月前