解析HTTP目录列表

Voj*_*ech 7 delphi delphi-xe

美好的一天!我正在使用Delphi XE和Indy TIdHTTP.使用Get方法我得到远程目录列表,我需要解析它=获取文件列表及其大小和时间戳,并区分文件和子目录.拜托,有一个很好的例程吗?先感谢您!Vojtech

这是样本:

<head>
  <title>127.0.0.1 - /</title>
</head>
<body>
  <H1>127.0.0.1 - /</H1><hr>
<pre>      
  Mittwoch, 30. März 2011    12:01        &lt;dir&gt; <A HREF="/SubDir/">SubDir</A><br />
  Mittwoch, 9. Februar 2005    17:14          113 <A HREF="/file.txt">file.txt</A><br />
</pre>
<hr>
</body>
Run Code Online (Sandbox Code Playgroud)

Cos*_*und 7

鉴于代码示例,我想解析它的最快方法是这样的:

  • 标识<pre>...</pre>包含所有列表行的块.应该很容易.
  • 把一切之间<pre></pre>TStringList.每行都是文件或文件夹,格式非常简单.
  • 从每一行中提取链接,根据需要提取日期,时间和大小.最好的正则表达式(你有Delphi XE所以你有内置的正则表达式).

  • +1.有样品时很容易.我问你不高兴吗?:) (6认同)

kob*_*bik 7

这应该为您提供一个良好的开端和使用DOM的想法:

uses
  MSHTML,
  ActiveX,
  ComObj;

procedure DocumentFromString(Document: IHTMLDocument2; const S: WideString);
var
  v: OleVariant;
begin
  v := VarArrayCreate([0, 0], varVariant);
  v[0] := S;
  Document.Write(PSafeArray(TVarData(v).VArray));
  Document.Close;
end;

function StripMultipleChar(const S: string; const C: Char): string;
begin
  Result := S;
  while Pos(C + C, Result) <> 0 do
    Result := StringReplace(Result, C + C, C, [rfReplaceAll]);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  Document: IHTMLDocument2;
  Elements: IHTMLElementCollection;
  Element: IHTMLElement;
  I: Integer;
  Line: string;
begin
  Document := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
  DocumentFromString(Document, '<head>...'); // your HTML here

  Elements := Document.all.tags('A') as IHTMLElementCollection;
  for I := 0 to Elements.length - 1 do
  begin
    Element := Elements.item(I, '') as IHTMLElement;
    Memo1.Lines.Add('A HREF=' + Element.getAttribute('HREF', 2));
    Memo1.Lines.Add('A innerText=' + Element.innerText);

    // Text is returned immediately before the element
    Line := (Element as IHTMLElement2).getAdjacentText('beforeBegin');

    // Line => "Mittwoch, 30. März 2011 12:01 <dir>" OR:
    // Line => "Mittwoch, 9. Februar 2005 17:14 113"...
    // I don't know what is the actual delimiter:
    // It could be [space] or [tab] so we need to normalize the Line
    // If it's tabs then it's easier because the timestamps also contains spaces

    Line := Trim(Line);
    Line := StripMultipleChar(Line, #32); // strip multiple Spaces sequences
    Line := StripMultipleChar(Line, #9);  // strip multiple Tabs sequences

    // TODO: ParseLine (from right to left)

    Memo1.Lines.Add(Line);
    Memo1.Lines.Add('-------------');
  end;
end;
Run Code Online (Sandbox Code Playgroud)

输出:

A HREF=/SubDir/
A innerText=SubDir
Mittwoch, 30. März 2011 12:01 <dir>
-------------
A HREF=/file.txt
A innerText=file.txt
Mittwoch, 9. Februar 2005 17:14 113
-------------
Run Code Online (Sandbox Code Playgroud)

编辑:
我已经更改了StripMultipleChar实现更简化.但我相信前一版本更加优化了速度.考虑到线条长度非常短的事实,性能没有太大差异.