解析'ul'和'ol'标签

Ani*_*ari 11 ruby algorithm ruby-on-rails nokogiri ruby-on-rails-4

我要处理的深度嵌套ul,olli标签.我需要提供与浏览器中相同的视图.我想在pdf文件中实现以下示例:

 text = "
<body>
    <ol>
        <li>One</li>
        <li>Two

            <ol>
                <li>Inner One</li>
                <li>inner Two

                    <ul>
                        <li>hey

                            <ol>
                                <li>hiiiiiiiii</li>
                                <li>why</li>
                                <li>hiiiiiiiii</li>
                            </ol>
                        </li>
                        <li>aniket </li>
                    </li>
                </ul>
                <li>sup </li>
                <li>there </li>
            </ol>
            <li>hey </li>
            <li>Three</li>
        </li>
    </ol>
    <ol>
        <li>Introduction</li>
        <ol>
            <li>Introduction</li>
        </ol>
        <li>Description</li>
        <li>Observation</li>
        <li>Results</li>
        <li>Summary</li>
    </ol>
    <ul>
        <li>Introduction</li>
        <li>Description

            <ul>
                <li>Observation

                    <ul>
                        <li>Results

                            <ul>
                                <li>Summary</li>
                            </ul>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Overview</li>
    </ul>
</body>"
Run Code Online (Sandbox Code Playgroud)

我必须用虾来完成我的任务.但是大虾不支持HTML标签.所以,我想出了一个解决方案nokogiri:.我正在解析,然后用gsub删除标签.我已经针对上述内容的一部分编写了以下解决方案,但问题是ul和ol可能会有所不同.

     RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end


  puts doc.inner_text


1. One
2. Two

1. Inner One
2. inner Two

• hey

1. hiiiiiiiii
2. why
3. hiiiiiiiii


• aniket 


3. sup 
4. there 

3. hey 
4. Three



1. Introduction

1. Introduction

2. Description
3. Observation
4. Results
5. Summary



• Introduction
• Description

• Observation

• Results

• Summary






• Overview
Run Code Online (Sandbox Code Playgroud)

问题

1)我想要实现的是如何在处理ul和ol标签时处理空间
2)当li进入ul或li进入ol内部时如何处理深度嵌套

stw*_*ert 5

我想出了一个解决方案,该解决方案使用每级可配置的计数规则来处理多个标识:

require 'nokogiri'
ROMANS = %w[i ii iii iv v vi vii viii ix]

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{('a'..'z').to_a[index]}. " },
    3 => ->(index) { "#{ROMANS.to_a[index]}. " },
    4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "\u25E6 " },
    3 => ->(_) { "* " },
    4 => ->(_) { "- " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol:root').each do |group|
  binding.pry
  ol_rule(group, deepness: 1)
end

doc.search('ul:root').each do |group|
  ul_rule(group, deepness: 1)
end
Run Code Online (Sandbox Code Playgroud)

然后,您可以根据环境删除标签或使用doc.inner_text。

但是有两个警告:

  1. 您的输入选择器必须经过仔细选择。我使用了没有根元素的逐字记录,因此不得不使用ul:root / ol:root。也许“ body> ol”也适用于您的情况。也许选择每个ol / ul,但比步行每个仅找到没有上级列表的那些。
  2. 使用您的逐字示例,Nokogiri不能很好地处理第一组ol的最后两个列表项(“嘿”,“三个”)。使用nokogiri进行解析时,元素已经“离开”了它们的ol树并被置于根中树。

电流输出:

  1. One
  2. Two
      a. Inner One
      b. inner Two
        ? hey
        ? hey
      3. hey
      4. hey
  hey
  Three

  1. Introduction
    a. Introduction
  2. Description
  3. Observation
  4. Results
  5. Summary

  • Introduction
  • Description
      ? Observation
          * Results
              - Summary
  • Overview
Run Code Online (Sandbox Code Playgroud)