获取文档中的所有链接

Question

获取文档中的所有链接

drz*_*aus 10 google-docs google-apps-script

鉴于Google文档/云端硬盘中的"普通文档"(例如段落,列表,表格)包含分散在整个内容中的外部链接,您如何使用Google Apps脚本编制链接列表？

具体来说,我想通过在每个URL中搜索oldText来更新文档中所有损坏的链接,并将其替换为每个URL中的newText,而不是文本.

我不认为开发文档的替换文本部分是我需要的 - 我是否需要扫描文档的每个元素？我可以编辑AsText并使用html正则表达式吗？例子将不胜感激.

Answer 1

Mog*_*dad 16

这只是痛苦的!代码可作为要点的一部分.

^{_{是的,我不能拼写.}}

getAllLinks

这是一个实用程序函数,它扫描文档中的所有LinkUrls,并将它们返回到一个数组中.

/**
 * Get an array of all LinkUrls in the document. The function is
 * recursive, and if no element is provided, it will default to
 * the active document's Body element.
 *
 * @param {Element} element The document element to operate on. 
 * .
 * @returns {Array}         Array of objects, vis
 *                              {element,
 *                               startOffset,
 *                               endOffsetInclusive, 
 *                               url}
 */
function getAllLinks(element) {
  var links = [];
  element = element || DocumentApp.getActiveDocument().getBody();

  if (element.getType() === DocumentApp.ElementType.TEXT) {
    var textObj = element.editAsText();
    var text = element.getText();
    var inUrl = false;
    for (var ch=0; ch < text.length; ch++) {
      var url = textObj.getLinkUrl(ch);
      if (url != null) {
        if (!inUrl) {
          // We are now!
          inUrl = true;
          var curUrl = {};
          curUrl.element = element;
          curUrl.url = String( url ); // grab a copy
          curUrl.startOffset = ch;
        }
        else {
          curUrl.endOffsetInclusive = ch;
        }          
      }
      else {
        if (inUrl) {
          // Not any more, we're not.
          inUrl = false;
          links.push(curUrl);  // add to links
          curUrl = {};
        }
      }
    }
  }
  else {
    var numChildren = element.getNumChildren();
    for (var i=0; i<numChildren; i++) {
      links = links.concat(getAllLinks(element.getChild(i)));
    }
  }

  return links;
}

Run Code Online (Sandbox Code Playgroud)

findAndReplaceLinks

该实用程序构建于getAllLinks执行查找和替换功能的基础上.

/**
 * Replace all or part of UrlLinks in the document.
 *
 * @param {String} searchPattern    the regex pattern to search for 
 * @param {String} replacement      the text to use as replacement
 *
 * @returns {Number}                number of Urls changed 
 */
function findAndReplaceLinks(searchPattern,replacement) {
  var links = getAllLinks();
  var numChanged = 0;

  for (var l=0; l<links.length; l++) {
    var link = links[l];
    if (link.url.match(searchPattern)) {
      // This link needs to be changed
      var newUrl = link.url.replace(searchPattern,replacement);
      link.element.setLinkUrl(link.startOffset, link.endOffsetInclusive, newUrl);
      numChanged++
    }
  }
  return numChanged;
}

Run Code Online (Sandbox Code Playgroud)

演示用户界面

为了演示这些实用程序的使用,这里有几个UI扩展:

function onOpen() {
  // Add a menu with some items, some separators, and a sub-menu.
  DocumentApp.getUi().createMenu('Utils')
      .addItem('List Links', 'sidebarLinks')
      .addItem('Replace Link Text', 'searchReplaceLinks')
      .addToUi();
}

function searchReplaceLinks() {
  var ui = DocumentApp.getUi();
  var app = UiApp.createApplication()
                 .setWidth(250)
                 .setHeight(100)
                 .setTitle('Change Url text');
  var form = app.createFormPanel();
  var flow = app.createFlowPanel();
  flow.add(app.createLabel("Find: "));
  flow.add(app.createTextBox().setName("searchPattern"));
  flow.add(app.createLabel("Replace: "));
  flow.add(app.createTextBox().setName("replacement"));
  var handler = app.createServerHandler('myClickHandler');
  flow.add(app.createSubmitButton("Submit").addClickHandler(handler));
  form.add(flow);
  app.add(form);
  ui.showDialog(app);
}

// ClickHandler to close dialog
function myClickHandler(e) {
  var app = UiApp.getActiveApplication();

  app.close();
  return app;
}

function doPost(e) {
  var numChanged = findAndReplaceLinks(e.parameter.searchPattern,e.parameter.replacement);
  var ui = DocumentApp.getUi();
  var app = UiApp.createApplication();

  sidebarLinks(); // Update list

  var result = DocumentApp.getUi().alert(
      'Results',
      "Changed "+numChanged+" urls.",
      DocumentApp.getUi().ButtonSet.OK);
}


/**
 * Shows a custom HTML user interface in a sidebar in the Google Docs editor.
 */
function sidebarLinks() {
  var links = getAllLinks();
  var sidebar = HtmlService
          .createHtmlOutput()
          .setTitle('URL Links')
          .setWidth(350 /* pixels */);

  // Display list of links, url only.
  for (var l=0; l<links.length; l++) {
    var link = links[l];
    sidebar.append('<p>'+link.url);
  }

  DocumentApp.getUi().showSidebar(sidebar);
}

Run Code Online (Sandbox Code Playgroud)

你在哪里运行这段代码？在 Chrome 开发者控制台中？如果是这样，当运行“getAllLinks”时，我会收到错误“Uncaught SyntaxError: Unexpected token **”。另外，在谷歌文档中我看不到“Utils”选项卡，如上图所示 (2认同)
@JoannaMarietti 哦...我第一次听说脚本编辑器。凉爽的！谢谢 (2认同)

Answer 2

Yuv*_*val 5

对于您的第一个问题，我提供了另一个简短的答案，该问题涉及遍历文档正文中的所有链接。此说明性代码返回当前文档主体中的链接的平面数组，其中每个链接由一个对象表示，该对象具有指向文本元素（text），包含该元素的段落元素或列表项元素（paragraph），偏移索引的条目在出现链接的文本（startOffset）和URL本身（url）中。希望您会发现轻松满足自己的需求。

它使用该getTextAttributeIndices()方法，而不是遍历文本的每个字符，因此可以预期比以前编写的答案执行得更快。

编辑：自最初发布此答案以来，我几次修改了该功能。现在，它还（1）包含endOffsetInclusive每个链接的属性（请注意，它可以null用于扩展到text元素末尾的链接-在这种情况下，可以link.text.length-1代替使用）；（2）在文档的所有部分中查找链接，而不仅仅是正文，并且（3）包含section和isFirstPageSection属性以指示链接的位置；（4）接受参数mergeAdjacent，当将其设置为true时，对于链接到同一URL的连续文本，该参数将仅返回单个链接条目（例如，如果部分文本的样式与另一部分）。

为了在所有部分下都包含链接iterateSections()，引入了一个新的实用程序功能。

/**
 * Returns a flat array of links which appear in the active document's body. 
 * Each link is represented by a simple Javascript object with the following 
 * keys:
 *   - "section": {ContainerElement} the document section in which the link is
 *     found. 
 *   - "isFirstPageSection": {Boolean} whether the given section is a first-page
 *     header/footer section.
 *   - "paragraph": {ContainerElement} contains a reference to the Paragraph 
 *     or ListItem element in which the link is found.
 *   - "text": the Text element in which the link is found.
 *   - "startOffset": {Number} the position (offset) in the link text begins.
 *   - "endOffsetInclusive": the position of the last character of the link
 *      text, or null if the link extends to the end of the text element.
 *   - "url": the URL of the link.
 *
 * @param {boolean} mergeAdjacent Whether consecutive links which carry 
 *     different attributes (for any reason) should be returned as a single 
 *     entry.
 * 
 * @returns {Array} the aforementioned flat array of links.
 */
function getAllLinks(mergeAdjacent) {
  var links = [];

  var doc = DocumentApp.getActiveDocument();


  iterateSections(doc, function(section, sectionIndex, isFirstPageSection) {
    if (!("getParagraphs" in section)) {
      // as we're using some undocumented API, adding this to avoid cryptic
      // messages upon possible API changes.
      throw new Error("An API change has caused this script to stop " + 
                      "working.\n" +
                      "Section #" + sectionIndex + " of type " + 
                      section.getType() + " has no .getParagraphs() method. " +
        "Stopping script.");
    }

    section.getParagraphs().forEach(function(par) { 
      // skip empty paragraphs
      if (par.getNumChildren() == 0) {
        return;
      }

      // go over all text elements in paragraph / list-item
      for (var el=par.getChild(0); el!=null; el=el.getNextSibling()) {
        if (el.getType() != DocumentApp.ElementType.TEXT) {
          continue;
        }

        // go over all styling segments in text element
        var attributeIndices = el.getTextAttributeIndices();
        var lastLink = null;
        attributeIndices.forEach(function(startOffset, i, attributeIndices) { 
          var url = el.getLinkUrl(startOffset);

          if (url != null) {
            // we hit a link
            var endOffsetInclusive = (i+1 < attributeIndices.length? 
                                      attributeIndices[i+1]-1 : null);

            // check if this and the last found link are continuous
            if (mergeAdjacent && lastLink != null && lastLink.url == url && 
                  lastLink.endOffsetInclusive == startOffset - 1) {
              // this and the previous style segment are continuous
              lastLink.endOffsetInclusive = endOffsetInclusive;
              return;
            }

            lastLink = {
              "section": section,
              "isFirstPageSection": isFirstPageSection,
              "paragraph": par,
              "textEl": el,
              "startOffset": startOffset,
              "endOffsetInclusive": endOffsetInclusive,
              "url": url
            };

            links.push(lastLink);
          }        
        });
      }
    });
  });


  return links;
}

/**
 * Calls the given function for each section of the document (body, header, 
 * etc.). Sections are children of the DocumentElement object.
 *
 * @param {Document} doc The Document object (such as the one obtained via
 *     a call to DocumentApp.getActiveDocument()) with the sections to iterate
 *     over.
 * @param {Function} func A callback function which will be called, for each
 *     section, with the following arguments (in order):
 *       - {ContainerElement} section - the section element
 *       - {Number} sectionIndex - the child index of the section, such that
 *         doc.getBody().getParent().getChild(sectionIndex) == section.
 *       - {Boolean} isFirstPageSection - whether the section is a first-page
 *         header/footer section.
 */
function iterateSections(doc, func) {
  // get the DocumentElement interface to iterate over all sections
  // this bit is undocumented API
  var docEl = doc.getBody().getParent();

  var regularHeaderSectionIndex = (doc.getHeader() == null? -1 : 
                                   docEl.getChildIndex(doc.getHeader()));
  var regularFooterSectionIndex = (doc.getFooter() == null? -1 : 
                                   docEl.getChildIndex(doc.getFooter()));

  for (var i=0; i<docEl.getNumChildren(); ++i) {
    var section = docEl.getChild(i);

    var sectionType = section.getType();
    var uniqueSectionName;
    var isFirstPageSection = (
      i != regularHeaderSectionIndex &&
      i != regularFooterSectionIndex && 
      (sectionType == DocumentApp.ElementType.HEADER_SECTION ||
       sectionType == DocumentApp.ElementType.FOOTER_SECTION));

    func(section, i, isFirstPageSection);
  }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，3 月前
查看次数：	8751 次
最近记录：	6 年，8 月前