My ongoing side project involves using data techniques to match people across data sources. It is a useful and surprisingly subtle operation because given data sets of n and m, comparing both by hand is a (n*m) operation. In a very good world, a comparison of sets can be a (n*C*log2(m)) operation.
The log2(m) function is very nice because the logarithm increases very slowly.
Log2( one million ) ~= 20
Log2( one billion ) ~= 30
Log2( one trillion ) ~= 40
There will always be a constant C because of overhead, but a (million*million) operation will break our computers but a (million*20) operation will not.
Factorials are also important in data matching. Factorials come up when we are trying to match various combinations of data points. Here is an excellent tutorial to understand the use of factorials in matching.
留言