ColdFusion provides a lot of functionality in the form of tags, but some of the more powerful features require a bit of programming knowledge, like the use of regular expressions, xml functions, and the recursive function construct. This tutorial will show how to use a regular expression to find all given elements in a string, then use an XML function to extract html attributes, and use a recursive function (calling a function from within the function until finished) to streamline the coding . In this case, we'll be searching for all links within a page, but the technique could be used to find email addresses, html tags, phone numbers, or other formatted pieces of text.

Getting a Web Page as a String

The function we'll create later takes a string as an argument, but in this case we want to grab a web page off the internet to search for a given string within the page. To do this requires the use of the <cfhttp> tag to grab the web content and pull the content into a string. The code is simple and is as follows:

<cfhttp method="get" url="#someurl#"></cfhttp>
<cfset thePage = cfhttp.FileContent>

The variable "someurl" has to contain a valid web address, and when the two lines of code run we'll have the resulting HTML source code contained within the variable "thePage".

Create a ColdFusion page called getLinks.cfm for the example. We could hard-code the url, as shown above, but we'll provide a form to allow the user to input a url, and put our <cfhttp> tag within a <cfif> statement that is only executed when the form is submitted:

<h1>Get URLs from a Web Page</h1>
<p>Enter an HTTP address to extract URLs from:</p>
<form id="form1" name="form1" method="post" action="">
  <input type="text" name="url" />
  <input type="submit" name="Submit" value="Submit" />
</form>
<cfif isdefined("form.submit")>
  <cfhttp method="get" url="#form.url#"></cfhttp>
  <cfset thePage = cfhttp.FileContent>
  <cfdump var="#thePage#">
</cfif>

The <cfdump> tag is in there to give you a test before we proceed. If you save and browse to the page now, you will be able to type in a web address, click the submit button, and view all the resulting html code. This is our starting point -- we have the HTML code. Now we need to examine it to extract all the links in the page.

Creating the getAllLinksFromPage Function

The function will work like this -- pass in a string, receive an array of url strings in return. The function actually takes two arguments, but the second argument will be used strictly by the recursive function itself. Since we want to create an array of url strings, we pass in the string, but the function will create an array. As it finds each url within the string, it adds it to the array and passes the remainder of the string -- along with the array -- back to the function again. When no urls are left in the string, the array is complete and it is returned to the original caller. Here is the function, commented inline:

<cffunction name="getAllLinksFromPage" returntype="any">
  <cfargument name="thePage" type="string">
  <cfargument name="linkArray" type="array" default="#ArrayNew(1)#">
  <!--- Set up some local variables --->
  <cfset var sLenPos = "">
  <cfset var atag = "">
  <cfset var link = "">
  <cfset var linkXml = "">
  <cfset var linkXmlAttributes = "">
  <!--- Use a regular expression to find all links (<a> tags) on the page --->
  <cfif REFind("<a[\s\S]*?<\/a>",thePage)>
    <!--- If we find a link, use the regular expression syntax to return the string and position--->
    <cfset sLenPos=REFindNoCase("<a[\s\S]*?<\/a>",thePage,1,true) />
    <cfset atag = mid(thePage, sLenPos.pos[1], sLenPos.len[1]) />
    <cftry>
      <!--- Use XML function xmlparse to grab attributes for <a> tag --->
      <cfset linkXml = xmlparse(atag)>
      <cfset linkXmlAttributes = linkXml["a"]['xmlattributes']>
      <cfif structkeyexists(linkXmlAttributes,"href")>
        <!--- Put the href attribute into a variable --->
        <cfset link = linkXmlAttributes["href"]>
        <!--- Append the link to the array --->
        <cfset arrayappend(linkArray,link)/>
      </cfif>
      <cfcatch>
        <!--- Put an error handler here --->
      </cfcatch>
    </cftry>
    <!--- Test the remaining string AFTER the match for more links --->
    <cfif REFind("<a[\s\S]*?<\/a>",thePage,sLenPos.pos[1] + sLenPos.len[1])>
      <!--- Found another link -- pass the remainder of the page back to the function to extract more links --->
      <cfset linkArray = getAllLinksFromPage(Mid(thePage,sLenPos.pos[1] + sLenPos.len[1], len(thePage)), linkArray)>
    </cfif>
  </cfif>
  <!--- Finished -- all links now in linkarray--->
  <cfreturn linkarray />
</cffunction>

A few words of explanation -- recursion is a useful technique, but not the only way to accomplish the task. I used it simply for demonstration purposes, but it is efficient in this case and works well. The XMLParse function was used just to illustrate it's use in the context of grabbing tag attributes. We could have used a regular expression in this case as well.

To utiilze the function, now we can remove the test <cfdump> tag and replace it with a function call, and another <cfdump> to dump the contents of the array.

<h1>Get URLs from a Web Page</h1>
<p>Enter an HTTP address to extract URLs from:</p>
<form id="form1" name="form1" method="post" action="">
  <input type="text" name="url" />
  <input type="submit" name="Submit" value="Submit" />
</form>
<cfif isdefined("form.submit")>
  <cfhttp method="get" url="#form.url#"></cfhttp>
  <cfset thePage = cfhttp.FileContent>
  <cfset theLinks = getAllLinksFromPage(thePage)>
  <cfdump var="#theLinks#">
</cfif>

The resulting page can now be run against any page. The following is the result of running it against the Community MX home page:


Figure 1: Links shown from www.communitymx.com

This technique has a lot of uses. I use this on the CMXTraneous blog to extract URLs from trackbacks to test for spam. Because it is done programmatically, you can use it in any situation where you need to extract links or any other item from an html or xml page.

Conclusion

This article has shown how to create a function to extract URLs from a string. In the article, several techniques were shown. We used <cfhttp> to put a web page into a string, regular expressions to extract tags from a string, xmlparse to extract attributes from a tag, and a recursive function to parse a string recursively until all items are extracted.