Extracting Links from a Page, Using ColdFusion

ColdFusion provides a lot of functionality in the form of tags, but some of the more powerful features require a bit of programming knowledge, like the use of regular expressions, xml functions, and the recursive function construct. This tutorial will show how to use a regular expression to find all given elements in a string, then use an XML function to extract html attributes, and use a recursive function (calling a function from within the function until finished) to streamline the coding . In this case, we'll be searching for all links within a page, but the technique could be used to find email addresses, html tags, phone numbers, or other formatted pieces of text.

Getting a Web Page as a String

The function we'll create later takes a string as an argument, but in this case we want to grab a web page off the internet to search for a given string within the page. To do this requires the use of the <cfhttp> tag to grab the web content and pull the content into a string. The code is simple and is as follows:

The variable "someurl" has to contain a valid web address, and when the two lines of code run we'll have the resulting HTML source code contained within the variable "thePage".

Create a ColdFusion page called getLinks.cfm for the example. We could hard-code the url, as shown above, but we'll provide a form to allow the user to input a url, and put our <cfhttp> tag within a <cfif> statement that is only executed when the form is submitted:

<h1>Get URLs from a Web Page</h1>
<p>Enter an HTTP address to extract URLs from:</p>
<form id="form1" name="form1" method="post" action="">
  <input type="text" name="url" />
  <input type="submit" name="Submit" value="Submit" />
</form>
<cfif isdefined("form.submit")>
  <cfhttp method="get" url="#form.url#"></cfhttp>
  <cfset thePage = cfhttp.FileContent>
  <cfdump var="#thePage#">
</cfif>

The <cfdump> tag is in there to give you a test before we proceed. If you save and browse to the page now, you will be able to type in a web address, click the submit button, and view all the resulting html code. This is our starting point -- we have the HTML code. Now we need to examine it to extract all the links in the page.

Creating the getAllLinksFromPage Function

The function will work like this -- pass in a string, receive an array of url strings in return. The function actually takes two arguments, but the second argument will be used strictly by the recursive function itself. Since we want to create an array of url strings, we pass in the string, but the function will create an array. As it finds each url within the string, it adds it to the array and passes the remainder of the string -- along with the array -- back to the function again. When no urls are left in the string, the array is complete and it is returned to the original caller. Here is the function, commented inline:

<cffunction name="getAllLinksFromPage" returntype="any">
  <cfargument name="thePage" type="string">
  <cfargument name="linkArray" type="array" default="#ArrayNew(1)#">
  
  <cfset var sLenPos = "">
  <cfset var atag = "">
  <cfset var link = "">
  <cfset var linkXml = "">
  <cfset var linkXmlAttributes = "">
  
  <cfif REFind("<a[\s\S]*?<\/a>",thePage)>
    
    <cfset sLenPos=REFindNoCase("<a[\s\S]*?<\/a>",thePage,1,true) />
    <cfset atag = mid(thePage, sLenPos.pos[1], sLenPos.len[1]) />
    <cftry>
      
      <cfset linkXml = xmlparse(atag)>
      <cfset linkXmlAttributes = linkXml["a"]['xmlattributes']>
      <cfif structkeyexists(linkXmlAttributes,"href")>
        
        <cfset link = linkXmlAttributes["href"]>
        
        <cfset arrayappend(linkArray,link)/>
      </cfif>
      <cfcatch>
        
      </cfcatch>
    </cftry>
    
    <cfif REFind("<a[\s\S]*?<\/a>",thePage,sLenPos.pos[1] + sLenPos.len[1])>
      
      <cfset linkArray = getAllLinksFromPage(Mid(thePage,sLenPos.pos[1] + sLenPos.len[1], len(thePage)), linkArray)>
    </cfif>
  </cfif>
  
  <cfreturn linkarray />
</cffunction>

A few words of explanation -- recursion is a useful technique, but not the only way to accomplish the task. I used it simply for demonstration purposes, but it is efficient in this case and works well. The XMLParse function was used just to illustrate it's use in the context of grabbing tag attributes. We could have used a regular expression in this case as well.

To utiilze the function, now we can remove the test <cfdump> tag and replace it with a function call, and another <cfdump> to dump the contents of the array.

The resulting page can now be run against any page. The following is the result of running it against the Community MX home page:

Figure 1: Links shown from www.communitymx.com

This technique has a lot of uses. I use this on the CMXTraneous blog to extract URLs from trackbacks to test for spam. Because it is done programmatically, you can use it in any situation where you need to extract links or any other item from an html or xml page.

Conclusion

This article has shown how to create a function to extract URLs from a string. In the article, several techniques were shown. We used <cfhttp> to put a web page into a string, regular expressions to extract tags from a string, xmlparse to extract attributes from a tag, and a recursive function to parse a string recursively until all items are extracted.