Discovering Search Terms

More trawling through old code I had written brought this one to the surface. One of the requirements of the system I’m working on was to intercept a 404 (Page Not Found) response and determine if the referrer was a search engine (e.g. google) to redirect to a search page with the search term. Intercepting the 404 was quite easily done with a Http Module…

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.Web;

namespace DemoApplication
{
    public class SearchEngineRedirectModule : IHttpModule
    {
        HttpApplication _context;

        public void Dispose()
        {
            if (_context != null)
                _context.EndRequest -= new EventHandler(_context_EndRequest);
        }

        public void Init(HttpApplication context)
        {
            _context = context;
            _context.EndRequest += new EventHandler(_context_EndRequest);
        }

        void _context_EndRequest(object sender, EventArgs e)
        {
            string searchTerm = null;
            if (HttpContext.Current.Response.StatusCode == 404
                && (searchTerm = DiscoverSearchTerm(HttpContext.Current.Request.UrlReferrer)) == null)
            {
                HttpContext.Current.Response.Redirect("~/Search.aspx?q=" + searchTerm);
            }
        }

        public string DiscoverSearchTerm(Uri url)
        {
            …
        }
    }
}

Implementing DiscoverSearchTerm isn’t that difficult either. We just have to analyse search engine statistics to see which ones are most popular and analyse the URL produced when performing a search. Luckily for us, most are quite similar in that they use a very simple format that has the search term as a parameter in the query string. The search engines I analysed included live, msn, yahoo, aol, google and ask. The search term parameter of these engines was either named “p”, “q” or “query”.

Now, all we have to do is filter for all the requests that came from a search engine, find the search term parameter and return its value…

public string DiscoverSearchTerm(Uri url)
{
    string searchTerm = null;
    var engine = new Regex(@"(search.(live|msn|yahoo|aol).com)|(google.(com|ca|de|(co.(nz|uk))))|(ask.com)");
    if (url != null && engine.IsMatch(url.Host))
    {
        var queryString = url.Query;
        // Remove the question mark from the front and add an ampersand to the end for pattern matching.
        if (queryString.StartsWith("?")) queryString = queryString.Substring(1);
        if (!queryString.EndsWith("&")) queryString += "&";
        var queryValues = new Dictionary<string, string>();
        var r = new Regex(
        @"(?<name>[^=&]+)=(?<value>[^&]+)&",
        RegexOptions.IgnoreCase | RegexOptions.Compiled
        );
        string[] queryParams = { "q", "p", "query" };
        foreach (var match in r.Matches(queryString))
        {
            var param = ((Match)match).Result("${name}");
            if (queryParams.Contains(param))
                queryValues.Add(
                ((Match)match).Result("${name}"),
                ((Match)match).Result("${value}")
                );
        }
        if (queryValues.Count > 0)
            searchTerm = queryValues.Values.First();
    }
    return searchTerm;
}

The above code uses two regular expressions, one to filter for a search engine and the other to separate the query string. Once it’s decided that the URL is a search engine’s, it creates a collection of query string parameters that could be search parameters and returns the first one.

Unfortunately, there wasn’t enough time in the iteration for me to properly match the search engine with the correct query parameter, but as is most commonly the parameter comes into the query string quite early so it’s fairly safe to assume that the first match is correct.

Advertisements

Posted on 15 October, 2008, in Dev Stuff and tagged , , , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: