There are a few scenarios where you may want to generate your own unique sequence identifiers. You may be using a database or big data store that does not offer an auto-increment sequence feature. Or maybe you just want a short memorable alpha numeric order number capable of high volumes of unique values.
Microsoft’s Azure Table Storage is one big table database that does not support automatic sequence primary keys. In Azure you define your own partition keys and row keys and you are responsible for creating a way for those ids to be unique. Creating good unique keys for Azure is beyond the scope of this article, as the best Azure partition key and row key is rarely a meaningless generated identity. But there are some scenarios in which you have no other meaningful option. I am using this to generate memorable confirmation numbers.
There are a few tools already available to help generate sequence identifiers as numeric values and syncronize various nodes generating those ids into a datastore that does not support sequence identifiers natively; Notably SnowMaker for .Net applications to Azure, and twitter snowflake for Scala to Cassandra. These do not support generated identifiers based on strings with various character sets. I have included .Net code to generage string identifiers in Candor.
You may wonder why would you need different character sets? It comes down to having a shorter memorable value. How many characters does it take to represent a 64 bit number? In base 10 math (digits 0-9) this is 19 digits. If you are using a case insensitive alpha numeric character set for 36 base math, that only takes 14 characters. For a case sensitive alpha numeric character set (62 characters), a 12 character string.
Coming at it from another angle, how many unique values can you fit in a memorable 10 characters? In base 10 math (0-9) that is 10 billion values (including zero). In base 36 math (characters a-z, 0-9) that is 3,656,158,440,062,980 values (3 Quadrillion). In base 62 math (characters a-z, A-Z, 0-9) that is 839,299,365,868,340,000 values (839 Quadrillion).
What happens when you hit the limit of that character set in 10 digits? Well if your type was an Int64 number, you’re done as in the value either overflows or starts counting up from the lowest negative value it supports. Or maybe by then you can convert to an Int128 number? In the case of the alpha numeric character set, you are storing values in a string column of the database anyway. So just change your ID column from varchar(10) to varchar(11) or varchar(12); whatever gives you a few more years worth of IDs. I am working in Azure table storage these days, so I don’t even specify a column length making it a non issue. That Azure ID column will grow until the max value of an Azure string property is hit; That is 64kb or about 32,765 characters long! I hope you change something before your unique ID’s need to be that long!
With a string based identifier, most databases have no tangible length limit. So you will never be in danger of reaching the final allowed value as you would with a 32, 64, or 128 bit number. The limit with string incrementing is the value you increment by at any given time, which is really a limitation of the .net number system and not the string identifier. You can represent more than a 128 bit number as a string.
So far I have done this in 45 lines of C# code. Maybe you can make it shorter, but here is the concept.
First, you need to know something about the character set. How do you want to treat upper and lower case? Is alpha numeric (a-z, 0-9) a set of 36 characters or 62 characters? If a string being incremented is all lower case do you increment potentially with mixed case or leave it lower case? If it is mixed case, do you ignore that casing and replace each letter with the same case, or replace incremented characters with a specific case? There are lots more questions you may have, and I may not have choosen the ideal defaults.
The first assumption I made is that you want to increment the rightmost characters first and work your way left. Then after all characters are at the highest character position, you add a character to the left side and roll all other characters over to the lowest value character. This is how we do normal base 10 math with digits 0-9, which seems like a good convention to follow. This does not match ordering in file systems which do leftmost comparisons with mixed alpha and numeric values. File systems sort files in other meaningful ways to help normal users, but there is no possible algorithm to decide what the single next incremented value would be with that file system sorting algorithm; since there are multiple possible ways to increment a file name.
Another assumption is the case to use when the caller wants case insensitive incrementing with a source string that is mixed case. It counts the number of upper and lower characters and any changed characters during the increment and makes them whatever case most characters are, defaulting to upper when upper and lower case have the same count.
var chars = source.ToCharArray().ToList(); //string being incremented var characters = charSet.Characters; //potential characters to increment with if (ignoreCase && charSet.IsCaseSensitive) { //change 'characters' to an upper or lower case version var lowerCount = chars.Where(char.IsLower).Count(); var upperCount = chars.Where(char.IsUpper).Count(); if (lowerCount > upperCount) characters = charSet.Characters.Where(value => !char.IsLetter(value) || char.IsLower(value)).ToList(); else characters = charSet.Characters.Where(value => !char.IsLetter(value) || char.IsUpper(value)).ToList(); }
Once we know how many characters we are working with, that tells us what base of math we are using. Base 10 is what most people can add and subtract in their head. In the case of insensitive alpha numeric we are dealing with 36 unique characters, meaning 36 characters per digit, or in other words base 36 math.
var mathBase = ignoreCase ? charSet.CaseInsensitiveLength : charSet.CaseSensitiveLength;
How do you add a number in base 10 with characters 0-9? That is the question I asked myself before writing the most efficient mechanism I could to increment a string numerically. Young kids count on their hands; which is very inefficient the larger value you add by.
Also since most people think in base 10 it is a different answer than you may first think. Most people just add in columns and then deal with carry over by incrementing the next left column by one if the current column is more than 9. But we are potentially dealing with base 10 in your head and base 36 or base 62 math within the algorithm. But this technique is very close to what we need to do.
We do need to work our way from the left to the right and bring carry over with us as we increment the current position over the highest character value. So this means we start with a loop. Conceptually, I first tried with starting at index 0 and going up, and then I tried starting with index of the last position and counting down. neither works due to the next steps which need to do math based on that index. So I started with position 1 and incremented up, which works out later. The exit condition of the loop is not a simple iteration over the character positions because we may also need to add potentially many new positions. So we increment until there is no remaining value count that hasn’t been put into a character translation and until there is no remaining carry over from a prior position increment.
for (var position = 1; remain > 0 || carryOver > 0 ; position++) { ... }
The first thing inside the loop is to make sure the string character array being incremented is long enough to include the current position of the loop. It adds a whitespace character to the beginning if it is not long enough.
if (chars.Count < position) { chars.Insert(0, ' '); }
Now for the most complicated part of the algorithm, a two part calculation. The first line gets the ‘positionBase’. Think of it in base 10 math for a second. In the first whole number position, each character increment counts as an increase of 1. The same applies in any base math. The second position to the left in base 10 increments by 10, which happens to be the base math you are working with. So this code just uses the mathBase value for all positions to the left of the first, and 1 for the first position.
The second line gets the count to be incremented in this position only. Think back to base 10 math when you add in columns left to right. At the moment we only care about how far to increment that position. To do this we use a combination of division and a modulus (remainder of a division), but don’t forget the carry over. We are going to call the increment of this position the ‘posCount’.
var positionBase = position == 1 ? 1 : Math.Pow(mathBase, position - 1); var posCount = ((int)((remain / positionBase) % mathBase)) + carryOver;
To increment the current position character we have three possibilities. First, no change to the position in which case we do nothing. The second is the amount to increment is the same as the base math or number of unique characters in the set. In this case, we set the carry over to one and reduce the remaining amount to increment accordingly.
if (posCount == mathBase) { //no character change, pass along carry over remain -= ((posCount - carryOver) * (int)positionBase); carryOver = 1; }
The most common case is we increment the position by a something between 1 and the number of possible characters in the set (the ‘mathBase’). This is relatively simple as you just check the index of the current value character in the character set and increment from there using wrapping to the beginning of the list as necassary. If the wrap occurs you set the carry over to 1. Also, decrease the remaining value to increment accordingly. remember this remaining value is a base 10 number, with a potentially different base character set.
This logic also does a few things related to case sensitivity options.
else if (posCount > 0) { var posChar = chars[chars.Count - position]; var posCharIndex = characters.IndexOf(posChar); if (ignoreCase && posCharIndex == -1 && char.IsLetter(posChar)) { //Character removed when changing to an upper or lower case version, so get the equivalent case-insensitive character posChar = char.IsLower(posChar) ? char.ToUpper(posChar) : char.ToLower(posChar); posCharIndex = characters.IndexOf(posChar); } //if whitespace char, leave posCharIndex at -1 for replacement if (posCharIndex == -1 && !char.IsWhiteSpace(posChar)) throw new ArgumentException( "The source string contains characters not available in the specified lexical increment character set."); chars[chars.Count - position] = characters[((posCharIndex + posCount) % mathBase)]; carryOver = posCharIndex + posCount < characters.Count ? 0 : 1; remain = Math.Max(0, remain - posCount * (int)positionBase); }
In Candor, this is implemented as an extension method to String.
public static String LexicalAdd(this String source, LexicalCharacterSet charSet, Boolean ignoreCase, Int32 count) { var chars = source.ToCharArray().ToList(); if (count < 0) throw new ArgumentOutOfRangeException("count", count, "Only positive numbers can be added lexically at this time."); var characters = charSet.Characters; if (ignoreCase && charSet.IsCaseSensitive) { //change 'characters' to an upper or lower case version var lowerCount = chars.Where(char.IsLower).Count(); var upperCount = chars.Where(char.IsUpper).Count(); if (lowerCount > upperCount) characters = charSet.Characters.Where(value => !char.IsLetter(value) || char.IsLower(value)).ToList(); else characters = charSet.Characters.Where(value => !char.IsLetter(value) || char.IsUpper(value)).ToList(); } var mathBase = ignoreCase ? charSet.CaseInsensitiveLength : charSet.CaseSensitiveLength; Int32 remain = count, carryOver = 0; //position is counting from the right most character, since we are treating these characters as a number for (var position = 1; remain > 0 || carryOver > 0 ; position++) { if (chars.Count < position) { chars.Insert(0, ' '); } var positionBase = position == 1 ? 1 : Math.Pow(mathBase, position - 1); var posCount = ((int)((remain / positionBase) % mathBase)) + carryOver; if (posCount == mathBase) { //no character change, pass along carry over remain -= ((posCount - carryOver) * (int)positionBase); carryOver = 1; } else if (posCount > 0) { var posChar = chars[chars.Count - position]; var posCharIndex = characters.IndexOf(posChar); if (ignoreCase && posCharIndex == -1 && char.IsLetter(posChar)) { //Character removed when changing to an upper or lower case version, so get the equivalent case-insensitive character posChar = char.IsLower(posChar) ? char.ToUpper(posChar) : char.ToLower(posChar); posCharIndex = characters.IndexOf(posChar); } //if whitespace char, leave posCharIndex at -1 for replacement if (posCharIndex == -1 && !char.IsWhiteSpace(posChar)) throw new ArgumentException( "The source string contains characters not available in the specified lexical increment character set."); chars[chars.Count - position] = characters[((posCharIndex + posCount) % mathBase)]; carryOver = posCharIndex + posCount < characters.Count ? 0 : 1; remain = Math.Max(0, remain - posCount * (int)positionBase); } } return String.Concat(chars); }
This depends on class LexicalCharacterSet to define what is in a character set, and attributes about that set.
public class LexicalCharacterSet { private readonly IList<char> _characters; private readonly Int32 _caseInsensitiveLength; private readonly string _name; public IList<char> Characters { get { return _characters; } } public Int32 CaseInsensitiveLength { get { return _caseInsensitiveLength; } } public Int32 CaseSensitiveLength { get { return _characters.Count; } } public Boolean IsCaseSensitive { get { return CaseInsensitiveLength < CaseSensitiveLength; } } public String Name { get { return _name; } } public LexicalCharacterSet(String name, Int32 caseInsensitiveLength, IList<char> chars) { _name = name; _characters = new ReadOnlyCollection<char>(chars); _caseInsensitiveLength = caseInsensitiveLength; } public LexicalCharacterSet(String name, Int32 caseInsensitiveLength, IEnumerable<char> chars) { _name = name; _characters = new ReadOnlyCollection<char>(new List<char>(chars)); _caseInsensitiveLength = caseInsensitiveLength; } public char FindNext(char after, bool ignoreCase) { var index = _characters.IndexOf(after); for (var c = index + 1; c <= _characters.Count - 1; c++) { if (!_characters.ToString(CultureInfo.InvariantCulture) .Equals(after.ToString(CultureInfo.InvariantCulture), ignoreCase ? StringComparison.InvariantCultureIgnoreCase : StringComparison.InvariantCulture)) return _characters; } if (ignoreCase && char.IsLetter(after)) { for (var c = 0; c < index; c++) { if (!char.IsLetter(_characters)) return _characters; if (char.IsLower(after) && char.IsLower(_characters)) return _characters; if (char.IsUpper(after) && char.IsUpper(_characters)) return _characters; } } return _characters[0]; } }
Tatham Oddie, July 2011: Released: SnowMaker – a unique id generator for Azure (or any other cloud hosting environment)
Released: SnowMaker – a unique id generator for Azure (or any other cloud hosting environment)
Josh Twist, Nov 2010: Synchronizing Multiple Nodes in Windows Azure
http://msdn.microsoft.com/en-us/magazine/gg309174.aspx
Wikipedia: Lexicographical Order
http://wikipedia.org/wiki/Lexicographical_order
Wikipedia: Shortlex Order
http://en.wikipedia.org/wiki/Shortlex_order
1 Comment