Bayesian Filter

This is a simple Bayesian filter that allows you to have multiple categories. Most filters only allow two categories (such as Spam and Not Spam). Once trained, this will allow you to calculate the probability that a phrase belongs to a category.

Download

Click Here for the Source Code. It is saved here as a ".txt" file, but you will want to save it as a ".php" file.

Description

This is intended to be extremely simple to use. Once you set the defines and global variable at the top of the script, you create your Bayesian directory. Inside this directory, you will have a text file for each category. The category files simply contain example phrases that belong to the category.
Warning! Do not overtrain the filter with thousands of example phrases. It will just run extremely slow. Instead, start it with minimal examples and then add only the phrases that it gets wrong.
Once you have a folder setup, create the filter class in your script with: $my_filter=new Bayesian_Filter("bayesian_directory_name");
You have your filter created and you want to get the category of a phrase: $cat=$my_filter->probable_category("The phrase to check.");
That is all for common usage. There are examples of advanced usage in the script's comments.

Windows Users Note

This script was written for Linux. It works on Unix also. I have not tested it on Windows. The issue of compatability has to do with the direct calls to the operating system. By default, they are:

  • grep: Return the lines of a file/stdin that contain a supplied keyword.
  • wc -l: Return the number of lines in a file/stdin.
  • echo: Print a value to stdout.
If Windows has equivalents to these commands, this script should work. I do not have a Windows machine, so I cannot test it.

How Does It Work?

Assume you have the following three category files:
Name Address CSZ
Abraham Lincoln
John Lennon
Douglas Adams
Shaun Wagner
1600 Pennsylvania Ave.
One Microsoft Way
29-A Lincoln Center
Washington, DC 20500
Redmond, WA 98052-6399
Kansas City, MO 64154
29 Palms, CA
Charleston, SC 29401
Charleston, SC 29407
Notice that the files do not necessarily contain the same number of lines. This script takes that into account when it calculates probabilities.
Now, we have a filter that can classify a phrase as a person's name, a street address, or a combination of city, state, and zip code. Let's start with a simple example: classify John Wagner.

  • Break phrase into words: "John" and "Wagner"
    • Get Bayesian Probability for "John"
      • Calculate Frequencies: Frequency is the number of lines in which the word occurs divided by the number of lines total.
        • Is in 1/4 of Names.
        • Is in 0/3 of Addresses.
        • Is in 0/6 of CSZ.
        • Is in 1/13 of all samples.
      • Calculate Bayesian Probability: Bayesian Probability is the frequeny for the category divided by the frequency for all lines.
        • Names: (1/4)/(1/13) = 3.25
        • Addresses: (0/4)/(1/13) = 0
        • CSZ: (0/6)/(1/13) = 0
    • Get Bayesian Probability for "Wagner"
      • You can see that it is appears only once in Names, so it is identical to the Bayesian probability for "John".
  • Average the Bayesian probabilities for each word: Since both words had the same values, the average is identical the probabilities for either word.
  • Return the category with the highest probability: Names is the highest value, so it is returned.
Next, try something a little more complicated: 29 Lennon Way. We can quickly see that it is a street address. How about the filter?
  • Check the words "29", "Lennon", and "Way".
    • 29:
      • Names: (0/4)/(2/13) = 0
      • Addresses: (1/3)/(2/13) = 2.17
      • CSZ: (1/6)/(2/13) = 1.08
    • Lennon:
      • Names: (1/4)/(1/13) = 3.25
      • Addresses: (0/3)/(1/13) = 0
      • CSZ: (0/6)/(1/13) = 0
    • Way:
      • Names: (0/4)/(1/13) = 0
      • Addresses: (1/3)/(1/13) = 4.33
      • CSZ: (0/6)/(1/13) = 0
  • Average the probabilities:
    • Names: (0+3.25+0)/3 = 1.08
    • Addresses: (2.17+0+4.33)/3 = 2.17
    • CSZ: (1.08+0+0)/3 = 0.36
  • Return the highest category: Addresses
You can see that this would fail on the name "Charles Washington" (it would claim it to be a CSZ with Charleston and Washington being the only matches). However, adding Charles Washington to the names list would help. It may not cure the problem completely. That is why this is all about PROBABILITIES, not ABSOLUTES.