Mask

This is a function used to mask an integer ID with a key. It is a symmetric function. If you mask the masked ID with the same key, you will get the original ID back. I use this a lot because I work with large data sets and send small portions of it to researchers. I don't want the researchers to be able to join a data set from one request with a data set for another request. So, I mask the identifiers with a different key for each request. If I need to get the original identifiers back, I just mask the masked identifiers.

The Code

I put the code right up front for those who want it. There are many versions of this function because I use it in multiple environments. I have tested all versions to make sure that they have the same output, meaning that if you get a masked ID from one function, you will get the exact same masked ID from other version of the function.

This is the PHP version of this function:

	function mask($id, $key)
	{
		$id = ''.intval($id);
		$key = ''.intval($key);
		$mask = '';
		while(strlen($id)>strlen($key)) $key = $key.$key;
		for($i=0; $i<strlen($id); $i++)
		{
			$ci = $id{$i};
			$ck = $key{$i};
			$cm = 2*$ck-$ci;
			while($cm<0) $cm+= 10;
			while($cm>9) $cm-= 10;
			$mask.= $cm;
		}
		return $mask;
	}

This is the MySQL/MariaDB version of this function:

	create function mask(id bigint, ky bigint)
		returns bigint(20) deterministic
	begin
		declare cid char(255);
		declare cky char(255);
		declare cmk char(255);
		declare i int;
		declare iid int;
		declare iky int;
		declare imk int;
		set cid = cast(id as char);
		set cky = cast(ky as char(255));
		while(length(cid)>length(cky)) do
			set cky = concat(cky,cky);
		end while;
		set i=0;
		set cmk = '';
		while(i<char_length(cid)) do
			set i = i+1;
			set iid = cast(substring(cid,i,1) as int);
			set iky = cast(substring(cky,i,1) as int);
			set imk = 2*iky-iid;
			while(imk<0) do set imk=imk+10; end while;
			while(imk>9) do set imk=imk-10; end while;
			set cmk = concat(cmk, imk);
		end while;
		return cast(cmk as unsigned);
	end

Usage

Assume you have a huge set of data. Let's say it is the pay history for all employees at a large company and you need to keep that information private. You have been asked to provide the information for a set of the employees to a researcher doing an audit. Every employee has a social security number, but you don't want to give those to the researcher. So, you assign the researcher a key. Let's use the key 42. You use the mask function on every SSN to turn it into a masked SSN. For example, mask(123456789, 42) produces 725038169.

That is all good. You masked the SSN numbers so the researcher can't figure out who the employees are. But, it turns out that they need gender for the employees. You are asked to add that data to the researcher's data set. You need the original SSN to look up the new information. Simple. Use mask(725038169, 42) produces 123456789.

Now, a year passes and the researcher is doing another study. You don't want the researcher to be able to merge the previous data set with the new one. Just use a new key. Let's use 9669 for this study. You will see that mask(123456789, 9669) produces 709436509. There is no way that the researcher will know that 709436509 is the same employee as 725038169.

The Theory

Biwise XOR is a common method for encrypting binary data. It is fast and it guarantees a one-to-one relationship between the original data and the encrypted or masked data. I could use it to encrypt an integer, but the result will likely be a series of unprintable ASCII characters. What I want is a digit-wise decimal XOR function.

Digit-wise means that I work on each digit, one at a time. I pick off the digits by first converting the integer to a string. Then, I grab the characters by index in the string. Every digit in the ID is masked using a digit in the key. The result becomes a digit in the mask.

Decimal XOR simply means that I have a symmetric function. If A XOR B = C, then C XOR B = A. In binary, this works well. There is actually no such thing as XOR for decimal because XOR is a binary function. So, I had to invent decimal XOR.

I used the mathematical function for XOR. Given two binary values, A and B, A XOR B = (2×B-A)%2. I tried it with decimal numbers. I obviously had to change the modulus to 10 for decimal. It still holds. For two decimal values, A and B, A XOR B = (2×B-A)%10.

It is important to note that modulus has two very different ways to handle negative numbers. A simple way to implement the mod function is to take the absolute value of negative numbers. Then, -1%10 = 1. I don't want that implementation. I want -1%10 = 9. To force the implementation that I require, I wrote it as a while loop. In reality, I only need an if statement because the maximum negative value is -9.

You've read it. You can't unread it.