Feeding a Query

How to retrieve news and save them into a query object

 

Cláudio Alexandre da Costa Dias

 

Introduction

Recently, we have noticed the large number of sites sharing news about all kinds of subjects. These news are often available in HTML format, and, sometimes, in XML (RDF/RSS) format ? to be read using specific readers.

When news are in RDF/RSS format ? W3C standards, it?s pretty simple to implement a ColdFusion code to handle this XML, creating a query object for further display.

However, if news are only available in HTML format, some extra effort is needed to take these data out and to arrange them into a query object.

In order to fully understand this tutorial, you are required to know RDF/RSS standard, to fairly use Regular Expressions and to be familiar to ColdFusion MX XML functions.

 

Creating a Query Object from RDF/RSS data

First of all, let?s take a look at the simplest case: in other words, retrieve news from an RDF/RSS document.

We will take Ben Forta?s blog page in our examples ? http://www.forta.com/blog

 

Figure 1 : Ben Forta?s Blog

 

As we can see, there are a lot of news on screen?s right side. These news are also available in RDF/RSS format, through http://www.forta.com/blog/rss.cfm?mode=full.

This link gives us:

 

Figure 2 : RSS feed from Blog?s page

 

In other words, an XML document like any other. A complete description of RDF/RSS standard can be found at http://www.w3.org/RDF.

Let?s build the query object from this XML document. CFML coding of all used templates can be found at the bottom of this tutorial.

To write a CFML code doing this job, we?ll track the following steps:

 

Retrieving XML

We use <cfhttp> tag to retrieve the XML news document:

 
   <cfhttp url="http://www.forta.com/blog/rss.cfm?mode=full" method="GET">
 

This request results are stored in cfhttp.fileContent variable. It means this variable contains the news within an XML string. An output of this variable would be like Figure 2.

 

Converting a XML string to a XML object

In ColdFusion MX, there?s a new type of data: the XML object. Using XMLparse() function, we can convert an XML string into an XML object.

 
<cftry>
  <cfset xDoc = XMLparse(cfhttp.fileContent)>
  <cfcatch>
     Invalid RDF/RSS !
     <cfabort>
  </cfcatch>
   </cftry>
 

We use try...catch methodology to prevent from mal-formed XML. Dumping xDoc variable:

 

Figure 3 : xDoc XML object view

 

Identifying RSS version and searching for items

Once we have our XML object, xDoc, let?s identify to which RSS standard it belongs. To do this, we use the XML root element name ? xDoc.XmlRoot.XmlName.

In order to retrieve items from XML object ? the news themselves ? we use XMLsearch() function. It uses an XPath language expression to search an XML document and returns an array of XML object nodes that match the search criteria.

 
<cfswitch expression="#xDoc.XmlRoot.XmlName#">
  <cfcase value="rdf:RDF"><!--- Version 1.x --->
     <cfset arrItems = XMLSearch(xDoc, '/rdf:RDF/:item')>
  </cfcase>
  <cfcase value="rss"><!--- Version 0.9x --->
     <cfset arrItems = XMLSearch(xDoc, '/rss/channel/item')>
  </cfcase>
</cfswitch>
 

Each array element contains an XML object node <item></item>, which contains the elements title, description and link. We can see arrItems array in the next figure:

 

 

Figure 4 : arrItems array view

 

Creating query object

Now, we have news inside arrItems array elements. First, we create the query object, q_rss:

 
<cfset q_rss = queryNew("title, link, description")>
 

Looping over array elements, we get, for each item, text inside elements title, description and link.

 
<cfset n = arrayLen(arrItems)>
<!--- Loop over found items, populating query object --->
<cfloop index="i" from="1" to="#n#">
  <cfset queryAddRow(q_rss)>
  <cfset querySetCell(q_rss, "title", arrItems[i].title.xmlText,i)>
  <cfset querySetCell(q_rss, "link", arrItems[i].link.xmlText,i)>
  <cfset querySetCell(q_rss, "description", arrItems[i].description.xmlText,i)>
</cfloop>
 

Then, dumping q_rss:

Figure 5 : q_rss query ? final display

Creating a Query Object from a HTML news page

As we have seen, it?s fairly simple to create a query object from an RDF/RSS document. However, what if the RDF/RSS news document is not available? In other words, news are only available in HTML format.

The steps we?ll follow are essentially the same. But, as we don?t have the XML object, we won?t be able to use XMLsearch() function to retrieve items. Then, we have to search for items with another tool. How about Regular Expressions? They are quite helpful when searching patterns.

Let?s start working:

 

Retrieving HTML

We use <cfhttp> tag to retrieve the HTML news page:

 
   <cfhttp url="http://www.forta.com/blog" method="GET">
 

This request results are stored in cfhttp.fileContent variable. It means this variable contains the news within an HTML string. This string is, then, stored in sDoc variable.

 
<cfset sDoc = cfhttp.fileContent>
 

Creating Regular Expression

The hardest part of our job is to build a regular expression that matches Ben Forta?s HTML news text. We highly recommend you to use a regular expressions tester tool, which tests them as long as they are created.

At the bottom of this tutorial, an HTML application ? REtest.htm ? is given. It will help you when creating regular expressions.

Using it, we get to the following regular expression:

 
<cfset regExp = '<font color="336633"><b>([\s\S]*?)</b></font>[\s\S]*?
                 <font size="-1">([\s\S]*?)</font>[\s\S]*?
                 <a href="(index\.cfm\?mode=e&entry=[0-9]*?)">'>
 

Sub expressions ? terms inside parenthesis ? represent title, description and link to each item. Note that there are links to sub expressions as well as to next occurrences of search pattern.

Figure 6 : REtest.htm

 

Creating query object

First, we create the query object, q_rss:

 
<cfset q_rss = queryNew("title, link, description")>
 

We use, now, REfindNoCase() function to search sDoc text for the regular expression specified before. Note that the function call is nested in a loop, which tests the function return, through start variable.

As seen before, sub expressions title, description and link can be found in this order. Therefore, they match to positions 2, 3 and 4 pos and len arrays. These arrays are keys of the stResult structure, returned by REfindNoCase() function.

 
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_rss)>
     <cfset querySetCell(q_rss,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
     <cfset querySetCell(q_rss,"link",mid(sDoc,stResult.pos[4],stResult.len[4]))>
     <cfset querySetCell(q_rss,"description", 
       mid(sDoc,stResult.pos[3],stResult.len[3]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
 

Checking results:

Figure 7 : q_rss query final display

 

As it was seen before.

 

CFML coding

 

rss2query.cfm

<!--- Retrieve RSS data from Ben Forta's blog--->
<cfhttp url="http://www.forta.com/blog/rss.cfm?mode=full" method="GET">
 
<!--- Try to parse XML string into XML object  --->
<cftry>
  <cfset xDoc = XMLparse(cfhttp.fileContent)>
  <cfcatch>
     Invalid RDF/RSS !
     <cfabort>
  </cfcatch>
</cftry>
 
<!--- Define RSS version and search for items --->
<cfswitch expression="#xDoc.XmlRoot.XmlName#">
  <cfcase value="rdf:RDF"><!--- Version 1.x --->
     <cfset arrItems = XMLSearch(xDoc, '/rdf:RDF/:item')>
  </cfcase>
  <cfcase value="rss"><!--- Version 0.9x --->
     <cfset arrItems = XMLSearch(xDoc, '/rss/channel/item')>
  </cfcase>
</cfswitch>
 
<!--- Create the query object --->
<cfset q_rss = queryNew("title, link, description")>
 
<cfset n = arrayLen(arrItems)>
<!--- Loop over found items, populating query object --->
<cfloop index="i" from="1" to="#n#">
  <cfset queryAddRow(q_rss)>
  <cfset querySetCell(q_rss, "title", arrItems[i].title.xmlText,i)>
  <cfset querySetCell(q_rss, "link", arrItems[i].link.xmlText,i)>
  <cfset querySetCell(q_rss, "description", arrItems[i].description.xmlText,i)>
</cfloop>
 
<!--- Display results --->
<cfdump var="#q_rss#" label="RSS feed">
 

html2query.cfm

<!--- Retrieve HTML data from Ben Forta's blog--->
<cfhttp url="http://www.forta.com/blog" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
 
<!--- Define the regular expression to be used --->
<cfset regExp = '<font color="336633"><b>([\s\S]*?)</b></font>[\s\S]*?
                 <font size="-1">([\s\S]*?)</font>[\s\S]*?
                 <a href="(index\.cfm\?mode=e&entry=[0-9]*?)">'>
 
<!--- Create the query object --->
<cfset q_rss = queryNew("title, link, description")>
 
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_rss)>
     <cfset querySetCell(q_rss,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
     <cfset querySetCell(q_rss,"link",mid(sDoc,stResult.pos[4],stResult.len[4]))>
     <cfset querySetCell(q_rss,"description", mid(sDoc,stResult.pos[3],stResult.len[3]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
     
<!--- Display results --->
<cfdump var="#q_rss#" label="HTML feed">

 

REtest.htm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- 
   Authors Claudio Dias - claudio-alexandre@uol.com.br
           Anderson Mise - kenji@vardump.com
   Date    Jun-2003
-->
 
<html>
<head>
   <title>Regular Expressions Tester</title>
 
   <style type="text/css">
   	body { 
		font-family: verdana,arial,sans-serif;
		font-size: 10px;
		color: black; 
     		background-color : #ffe; 
	        }
	a { 
		color: #444;
		text-decoration : none; 
	       }
  	textarea { 
		font-family : verdana,arial,sans-serif;
		font-size : 10px;
		color : #666; 
		border : 1px solid #666;
		padding : 1px;
		width:100%;
		 }
   </style>
 
   <script language="JavaScript" type="text/javascript">
   function createRE() {
     try {
        if (document.REform.cbxCase.checked) 
           re = new RegExp (document.REform.re.value,'g')
        else
           re = new RegExp (document.REform.re.value,'gi');
     }
     catch(er) {
        document.REform.textFound.value = "[Invalid Regular Expression]";
     }
   }
   
   function REgetSubExp(n) {
     try {
        if ((arrFound) && (arrFound.length > n)) 
           document.REform.textFound.value = arrFound[n];
     }
     catch(er) {
        document.REform.textFound.value = "[Invalid Subexpression]";
     }
   }
 
   function nextMatch() {
     subExpDiv = document.getElementById("subExp");
     subExpDiv.innerHTML = "";
     try {
        if (re) {
           var text = document.REform.text.value;
           arrFound = re.exec(text);
           if (arrFound) {
              document.REform.textFound.value = arrFound[0];
              for (i=1;i<arrFound.length;i++)
                subExpDiv.innerHTML += '&nbsp;&nbsp;<a href="javascript:REgetSubExp('+ 
                   i +')">$'+ i +'</a>';
           }
           else document.REform.textFound.value = "";
        }
        else document.REform.textFound.value = "";
     }
     catch(er) {
        document.REform.textFound.value = "[Invalid Regular Expression]";
     }
   }
   </script>
</head>
 
<body>
<form name="REform">
   <label>Text to search
     <textarea rows="15" id="text" name="text" onKeyUp="createRE()"></textarea>
   </label>
   <label>Regular Expression
     <textarea rows="3" id="re" name="re" onKeyUp="createRE()"></textarea>
   </label>
   <label>
     <input type="checkbox" id="cbxCase" name="cbxCase" value="1" onClick="createRE()">
     Case sensitive
   </label>
   <div id="actions">
   	<a href="javascript:createRE();nextMatch();">Search Results</a>
   	<div id="subExp" style="display:inline;"></div>
	&nbsp;-&nbsp;
   	<a href="javascript:nextMatch()">Next occurrence</a>
   </div>
   <textarea id="textFound" rows="15" name="textFound" readonly></textarea>
</form>
</body>
</html> 

 


 

XPath : Linguagem para endereçar partes de um documento XML ? http://www.w3.org/TR/xpath


 

Cláudio Alexandre da Costa Dias ? claudio-alexandre @ uol.com.br

Senior Engineer ? EMBRAER ? Brazil

ColdFusion user since 1997

Responsible for ColdFusion MX usage and propagation in EMBRAER?s Flight-Test Division

About This Tutorial
Author: Claudio Dias
Skill Level: Advanced 
 
 
 
Platforms Tested: CFMX
Total Views: 100,672
Submission Date: May 25, 2004
Last Update Date: June 05, 2009
All Tutorials By This Autor: 1
Discuss This Tutorial
  • Hi I'm not clear with your explanation below Figure 1. Where did you get the path: http://www.forta.com/blog/rss.cfm?mode=full By the way what you did was great! I heard about RSS and feeding the a query but not sure how until I read your tutorial. Outside this path, everything is easily understood. Alec

  • Thanks for your tutorial. I modified it to update into a database and works like a champ for .91 RSS feeds. My questions is how to handle RSS 2.0 feeds. For some reason it seems to fail on the XMLparse function... Any suggestions?? Many thanks again!

  • Matt, thank you for your comments!

  • this guy adds the most useful (and clean) tuts i've seen on here. props.

  • Working with cfmx7 the regualar expression fails to work unless you put it all on one line in html2query.cfm.

Advertisement


Website Designed and Developed by Pablo Varando.