So you have an email list that has everything from valid, well-formed email addresses to useless gibberish that someone thought would get them through the form without hassle. Obviously, the best way to maintain an email list is to catch the badly formed email addresses before they occur -- through JavaScript or server-side validation. That does not always catch the culprits though. Most of these scripts use regular expressions, which catch the badly formed addresses but do nothing with bad addresses. Also, you may have inherited a list or maintained a list that did not validate addresses. This article will show a couple of SQL Server user-defined functions that you can use to manually go through your list and find/fix/delete the bad addresses.

The Workflow

Basically what we will be doing is using the SQL Server Management Studio to view the records in your database table, and manually make corrections if possible using two different functions, and then delete or make inactive the records that do not qualify as properly formed emails. You can use the old Enterprise Manager or Query Analyzer if you like, also, but I'll be describing the workflow in the Management Studio. The first function will check an email address for invalid characters, and returns a 1 if the email address is invalid. The second function checks the domain suffix (the .com or .net part) against a list of known suffixes and returns a 1 if the email address doesn't match. If you have never worked with functions in SQL Server before, they work similar to functions in any programming language -- you define the function and then call it in your program. In SQL Server, you can call functions from your SQL statement, making it very useful indeed. For example, a typical use of a SQL Server function would be like this:

SELECT myFunction(somefield) as NewField FROM myTable

In this case, the return value of the function is used as a new field in the select statement. We will do something a little different -- we'll use the function in the WHERE clause of the SQL, since the function returns a 1 or a 0 (basically signifying true or false.)

SELECT somefield FROM myTable WHERE myFunction(somefield) = 1

In this case, the function will return a 1 or a 0 and the SELECT statement will only return those records that return a value of 1.

Functions in the SQL Server Management Studio are listed under Programmability > Functions Scalar-valued Functions:


Figure 1: SQL Server Management studio functions

To add a new one, just right-click on the Scalar-valued Functions folder icon and choose New Scalar-valued Function. We'll add two new functions before continuing.

The emailFilter() Function

The first function will act as a filter, filtering emails with bad characters. The function body is as follows:

CREATE FUNCTION [dbo].[emailFilter]
(@email varchar(128))
RETURNS int AS
BEGIN
SELECT @email = ISNULL(@email, '')
declare @flag int
set @flag = CHARINDEX(' ',LTRIM(RTRIM(@email)))
+ CHARINDEX('.@',@email)
+ CHARINDEX('@@',@email)
+ CHARINDEX('..',@email)
+ charindex('<', @email)
+ charindex('>', @email)
+ charindex('[', @email)
+ charindex(']', @email)
+ charindex('(', @email)
+ charindex(')', @email)
+ charindex(',', @email)
+ charindex(';', @email)
+ charindex(':', @email)
+ charindex('\', @email)
+ charindex('"', @email)

if LEFT(LTRIM(@email),1) = '@'set @flag = @flag + 1
if RIGHT(RTRIM(@email),1) = '.' set @flag = @flag + 1
if CHARINDEX('.',@email,CHARINDEX('@',@email)) - CHARINDEX('@',@email) <= 1 set @flag = @flag + 1
if LEN(LTRIM(RTRIM(@email))) - LEN(REPLACE(LTRIM(RTRIM(@email)),'@','')) <> 1 set @flag = @flag + 1
if CHARINDEX('.',REVERSE(LTRIM(RTRIM(@email)))) < 3 set @flag = @flag + 1
if @email = '' set @flag = @flag + 1
if @email = 'null' set @flag = @flag + 1
if @email NOT LIKE '%@%.%' set @flag = @flag + 1
if @email IS NULL set @flag = @flag + 1
if @email LIKE '%@%@%' set @flag = @flag + 1

if @flag = 0 return 0
return 1
END

Basically, the function creates a flag variable and adds a number to it every time it reaches a bad part of the email address. This is not as effective as a regular expression, but is fast and effective in SQL where regular expression support is not usually available. Create this function in your database by running the statement in the query window.

The fnCheckDomainSuffix() Function

The next function is a little simpler, as it checks a table of known domain suffixes. We'll create the table in a minute.

CREATE FUNCTION [dbo].[fnCheckDomainSuffix]
(@field varchar(255))
RETURNS int AS
BEGIN
DECLARE @match int
SET @match = 0
SELECT @match = COUNT(*) FROM aaa_domains
WHERE @field LIKE '%' + domainext

return CASE WHEN @match = 0 THEN 1 ELSE 0 END
END

Again, we'll return a 1 if the email address does not match and a 0 if it does match, but we'll use a variable called @match to hold the position of the match within the email address. If the value is 0, no match was found and we'll return 1.

The following is the table we will be matching. You can run this in the query window to create the table and populate it with some sample domain suffixes.

CREATE TABLE [dbo].[_domain_suffixes](
[domainext] [varchar](255) NULL,
[description] [varchar](255) NULL
) ON [PRIMARY]

INSERT INTO _domain_suffixes (domainext,description) VALUES ('.biz','United States Business site ' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.coop','Business coops organizations' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.com','A top-level domain name used for commercial Internet sites in the United States' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.info','United States information site' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.jobs','Job related sites' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.museum','Worldwide museums' )
INSERT INTO _domain_suffixes (domainext,description) VALUES ('.name','Individuals, families' )

The data to populate it completely with 268 different top-level domains is in the zip package that accompanies the article.

Some Sample Data

To test this out, we'll need some sample data, including some good addresses and some bad ones. The following will create a sample table and populate it for use in this exercise.

CREATE TABLE [dbo].[tblUsers](
[userID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](20) NULL,
[LastName] [varchar](20) NULL,
[Email] [varchar](128) NULL,
[Active] [bit] NULL
) ON [PRIMARY]

GO

INSERT tblUsers (firstname, lastname, email, active) VALUES ('Tom','Muck','tomfake@hotmail.com',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Jack','Splat','jack@splat.com',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Fred','Flintstone','fred@flintstones.com',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Jimbo','Bingo','yadayada@yada.yada',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('adf','adf','asdf',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Biff','Bopper','biff@biffbopper',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Jacob','Jacobson','jakethesnake@@hotmail.com',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Slim','Typo','slimtypo@hotmail.comm',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Slender','Slim','null',1)
INSERT tblUsers (firstname, lastname, email, active) VALUES ('Yolanda','Yellow','x',1)

This gives you a small table of users with some invalid addresses. Obviously, with a list this small you could correct it manually, but if you have 1000 records you will find the following useful.

Using the Functions

With the functions and sample data in place, open a new query window in SQL and type the following statement:

SELECT * FROM tblUsers
WHERE dbo.emailFilter(email) = 1

That gives you the following table:

5 adf adf asdf
6 Biff Bopper biff@biffbopper
7 Jacob Jacobson jakethesnake@@hotmail.com
9 Slender Slim null
10 Yolanda Yellow x

Now, you can see the bad email addresses plainly. If you ran this on a table with a few thousand records, only the bad records will show. Now you can make some data cleanup manually. The first record is obviously bad -- we can either delete it or mark it invalid. We'll save that for later, though. The next record can possibly be corrected -- many people leave out the ".com" when they type their address by mistake. You can try to add the .com in there and see what happens. The next address has two @@. This is easy -- remove one. The last two again are bad data. We'll do something with these in a minute.

The second function is used the same way. We'll see some overlap in the bad data found, but that's ok. Run this statement:

SELECT * FROM tblUsers
WHERE dbo.fnCheckDomainSuffix(email) = 1

Now, we get results like this:

4 Jimbo Bingo yadayada@yada.yada
5 adf adf asdf
6 Biff Bopper biff@biffbopper
8 Slim Typo slimtypo@hotmail.comm
9 Slender Slim null
10 Yolanda Yellow x

Some of the bad records from the previous query are seen here, because they fail on this function too, but the new ones are different. The first, obviously an attempt to pass your email validation if you have it, is a well-formed email address but does not have a valid domain or top-level domain suffix. Obviously this needs to be deleted. The 4th record is probably a type -- ".comm" is probably a mistake and the user meant ".com". We can correct this one.

After correcting the addresses, we can run the query again, only this time use both functions:

SELECT * FROM tblUsers
WHERE dbo.emailFilter(email) = 1 OR dbo.fnCheckDomainSuffix(email) = 1

Now we get this:

4 Jimbo Bingo yadayada@yada.yada
5 adf adf asdf
9 Slender Slim null
10 Yolanda Yellow x

All bad, with no way to correct. If we want to delete these records, we can change the query to a DELETE query:

DELETE FROM tblUsers
WHERE dbo.emailFilter(email) = 1 OR dbo.fnCheckDomainSuffix(email) = 1

Alternatively, if we have a column where a user is marked active or inactive, we can set that value to "inactive", or in our table, a 0.

UPDATE tblUsers
SET active = 0
WHERE (dbo.emailFilter(Email) = 1) OR
(dbo.fnCheckDomainSuffix(Email) = 1)

Now, we have a much cleaner list and any email that bounces back to us is more likely to be an old or bad email address.

Conclusion

You can't prevent people from making mistakes or putting bad data into your web forms, but you can make some small efforts to correct data when it does happen. This article showed one way using SQL Server functions.