Goo*_*bot -2 php regex recursion preg-match xml-parsing
我有一个自然语言处理解析树为
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
Run Code Online (Sandbox Code Playgroud)
并且我想将其存储在关联数组中,但是PHP中没有函数,因为NLP通常在python中完成。
因此,我应该解析开始和结束括号,以构建树结构的关联数组。我可以想到两种选择
我认为第一种方法是非标准的,并且正则表达式模式在复杂情况下可能会中断。
您能建议一个可靠的方法吗?
关联数组可以具有任何形式,因为操作起来并不困难(我需要循环使用),但是它可以像
Array (
[0] = > word => ROOT, tag => S, children => Array (
[0] word => I, tag = > NP, children => Array()
[1] word => ROOT, tag => VP, children => Array (
[0] => word => ROOT, tag => VP, children => Array ( .... )
[1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)
Run Code Online (Sandbox Code Playgroud)
或者可以是
Array (
[0] = > Array([0] => S, [1] => Array (
[0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
[1] Array([0] => VP, [1] => Array (
[0] => Array([0] => VP, [1] => Array ( .... )
[1] => Array([0] => PP, [1] => Array ( .... )
)
)
Run Code Online (Sandbox Code Playgroud)
使用像bison或flex之类的词法分析器生成器,或者只用人工编写自己的词法分析器,此答案有您需要的一些有用信息。
这是用PHP编写的快速而又肮脏的POC代码段,它将按预期输出一个关联数组。
$data =<<<EOL
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
EOL;
$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);
function buildTree($lexer, $level)
{
$subtrees = [];
$markers = [];
while (($token = $lexer->nextToken()) !== false) {
if ($token == '(') {
$subtrees[] = buildTree($lexer, $level);
} elseif ($token == ')') {
return buildNode($markers, $subtrees);
} else {
$markers[] = $token;
}
}
return buildNode($markers, $subtrees);
}
function buildNode($markers, $subtrees)
{
if (count($markers) && count($subtrees)) {
return [$markers[0], $subtrees];
} elseif (count($subtrees)) {
return $subtrees;
} else {
return $markers;
}
}
class Lexer
{
private $data;
private $matches;
private $index = -1;
public function __construct($data)
{
$this->data = $data;
preg_match_all('/[\w]+|\(|\)/', $data, $matches);
$this->matches = $matches[0];
}
public function nextToken()
{
$index = ++$this->index;
if (isset($this->matches[$index]) === false) {
return false;
}
return $this->matches[$index];
}
}
Run Code Online (Sandbox Code Playgroud)
输出量
Array
(
[0] => Array
(
[0] => S
[1] => Array
(
[0] => Array
(
[0] => NP
[1] => I
)
[1] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => V
[1] => shot
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => an
)
[1] => Array
(
[0] => N
[1] => elephant
)
)
)
)
)
[1] => Array
(
[0] => PP
[1] => Array
(
[0] => Array
(
[0] => P
[1] => in
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => my
)
[1] => Array
(
[0] => N
[1] => pajamas
)
)
)
)
)
)
)
)
)
)
Run Code Online (Sandbox Code Playgroud)