I was trying to explain to someone why some sorting algorithms are faster than others. While there are many factors in the speed of a sort, we often talk about the number of comparisons needed. So the angle of attack I chose was to try and explain how organizing the data in certain ways allows you to avoid ever having to compare certain pairings of items because you effectively already know the result. What I said is that this was all a direct or indirect outcome of the Transitive Property.
The transitive property is this thing we learn about in early math education. It doesn't come up so often in conversation, but in computer science the transitive property provides the reason that we can sort a set of items without having to compare each pair. Getting one's head around the details of exactly how various algorithms do this--and what the limits are--turned out to be tougher than I thought. So I wrote a little bit about it.
Counting Comparisons
Let's imagine you tell me that you like Apples more than Bananas, and you like Bananas more than Carrots. It would be a waste of a time for people to ask you if you liked Apples better than Carrots. Because as long as you don't consider any two foods to be equal, and unless you've got some cognitive dissonance going on, we already know your answer:
Apples > Bananas Bananas > Carrots ...therefore... Apples > Carrots
So if you could only be asked two foods at once, then assuming you don't have "cognitive dissonance" and all your answers are logically consistent...then how many pairs of foods would you have to be asked about before the full ordering was given? Is two enough?
Well, let's look at all the possibilites for how you might have answered the two questions:
(I) Apples > Bananas Bananas > Carrots (II) Apples < Bananas Bananas > Carrots (III) Apples > Bananas Bananas < Carrots (IV) Apples < Bananas Bananas < Carrots
There are four different ways you could have answered these two questions. However, we know from the mathematical law of permutations that there are six different orderings of three items, which we can calculate by using the factorial
3 * 2 * 1
. Which did we leave out?1. Apples > Bananas > Carrots 2. Apples > Carrots > Bananas 3. Bananas > Apples > Carrots 4. Bananas > Carrots > Apples 5. Carrots > Apples > Bananas 6. Carrots > Bananas > Apples
We see that:
(I) implies (1) (II) implies either (3) or (4) (III) implies either (2) or (5) (IV) implies (6)
It might be more illustrative to rewrite the Q & A so the signs are lining up in
a consistent way:
(I) Apples > Bananas Bananas > Carrots (II) Bananas > Apples Bananas > Carrots (III) Apples > Bananas Carrots > Bananas (IV) Bananas > Apples Carrots > Bananas
We picked Bananas to be the item we are trying to get the "transitive benefit" from. It is the thing that appears in both comparisons. The two cases where we don't get that full benefit are the ones that don't have Bananas on a different side of the comparison, so it isn't the "bridge". We know the transitive property in forms like:
"if A > B and B > C, then A > C" ...and... "if A < B and B < C, then A < C"
But no one ever mentions the converses:
"if A > B and C > B, then either A > C or A < C" ...and... "if A < B and C < B, then either A > C or A < C"
In other words, we could only use the relationship of A to B and C to B to learn anything about the relationship between A and C in half the outcomes. In the other half, we learned nothing and are left to compare them directly.
So if we were lucky, and used bananas as the bridge in our two questions, and Bananas happen to be right in the middle of the set...then just two questions was enough to sort it. But 4 times out of 6 Bananas are not in the middle, and we're going to need a third question comparing Apples and Carrots to finish the sort.
Can a Sort be "Psychic"?
Imagine an algorithm that could automatically pick whatever element is in the middle of the set. So if it knew you had scenarios (3) or (5) it would have asked "Do you like Bananas or Apples better?" and then "Do you like Carrots or Apples better?" If it knew you had scenarios (2) or (4) it would have asked "Do you like Bananas or Carrots better?" and then "Do you like Carrots or Apples better?"
Such a psychic sorting algorithms could be very fast indeed! If you have a list of N items, ordered as:
Apples > Bananas > Carrots > Dates > Eggplant
...it is theoretically possible to sort those with only N-1 comparisons:
Q: "Do you like Apples or Bananas better?"A: "Apples."
Q: "Do you like Bananas or Carrots better?"A: "Bananas."
Q: "Do you like Carrots or Dates better?"A: "Carrots."
Q: "Do you like Dates or Eggplant better?"A: "Dates."
Transitive property for the win! Only four questions...and that's the only configuration of those five items that could possibly apply. Sort is done.
But if we tried to use those questions on:
Bananas > Dates > Eggplant > Carrots > Apples
...it wouldn't be quite so easy.
Q: "Do you like Apples or Bananas better?"A: "Bananas."
Q: "Do you like Bananas or Carrots better?"A: "Bananas."
Q: "Do you like Carrots or Dates better?"A: "Dates."
Q: "Do you like Dates or Eggplant better?"A: "Dates."
The three relationships we didn't learn about after the above line of questioning are how Apples relate to Carrots, how Carrots relate to Eggplant, or how Bananas relate to Dates. It's easy to see how three more questions could sort that out, impossible to imagine how just one could do it. If we break our brain a little we might be able to conceive of a scenario where two questions might give us a full order if we are lucky.
The scenario where we're going to get maximum transitive power is going to be one where an existing element bridges us, and we luck out with the bridged answer. Let's try and deduce the Banana-Date relationship by lucking out with the other questions.
Q: "Do you like Apples or Carrots better?"A: "Carrots"
- therefore C > A
- cannot combine with existing knowledge that C > D
- cannot combine with existing knowledge that B > A
- combine with existing knowledge that B > C so B > A
But we already knew that B > A. :-/ Next question...
Q: "Do you like Carrots or Eggplant better?"A: "Eggplant"
- therefore E > C
- cannot combine with existing knowledge that D > C
- cannot combine with existing knowledge that B > C
- combine with existing knowledge that D > E so D > C
Once again something we already knew...that D > C. Okay, try again...
Q: "Do you like Bananas or Dates better?"A: "Bananas"
- therefore B > D
- cannot combine with existing knowledge that B > A
- cannot combine with existing knowledge that B > C
- cannot combine with existing knowledge that B > A
- combine with existing knowledge that D > C so B > C
- combine with existing knowledge that D > E so B > E
Finally a new thing is learned: that B > E. So here the only question with transitive leverage is the Banana vs. Date one...teaching us something about the relationships between Bananas and Eggplant. Could that finding give us a tricky next question that allowed us to deduce the C to A and C to E relationship? I don't see one.
Lower Bounding Sort
This might lead us to ask how many comparisons might you have to do in a pathological sorting case of five elements, and still have a lower bound for a correct sorting algorithm? I'll tell you that it's a published number you can look up, and it is 7:
As I dug deeper I found this question isn't as simple to know the answer to as I might have thought. The table only goes up to proving that 15 elements minimally sorts with 42 comparisons. I wondered how close a heavily peer-optimized sorting algorithm would come to these numbers on some random data, and wrote this little C++ program to test it:
#include <ctime>
#include <vector>
#include <iostream>
#include <iomanip>
#include <algorithm>
using namespace std;
int comparisons;
// http://oeis.org/A036604
int minimums[] = {
0, 0, 1, 3, 5, 7, 10, 13, 16,
19, 22, 26, 30, 34, 38, 42
};
#define NELEMS(a) (sizeof(a) / sizeof(a[0]))
bool comparator(const int& lhs, const int& rhs)
{
comparisons++;
return lhs < rhs;
}
int getRandom() {
return rand() % 500;
}
int main() {
for (int n = 1; n < NELEMS(minimums); n++) {
vector<int> items(n, 0);
generate(items.begin(), items.end(), getRandom);
comparisons = 0;
sort(items.begin(), items.end(), comparator);
cout <<
"For " << setw(2) << n << " items: " <<
setw(2) << comparisons << " compares " <<
"(minimum is " << setw(2) << minimums[n] <<
")\n";
}
}
The resulting table shows that the comparisons drift farther and farther from the minimum as the number of elements increases:
For 1 items: 0 compares (minimum is 0) For 2 items: 1 compares (minimum is 1) For 3 items: 4 compares (minimum is 3) For 4 items: 6 compares (minimum is 5) For 5 items: 11 compares (minimum is 7) For 6 items: 16 compares (minimum is 10) For 7 items: 18 compares (minimum is 13) For 8 items: 23 compares (minimum is 16) For 9 items: 29 compares (minimum is 19) For 10 items: 24 compares (minimum is 22) For 11 items: 31 compares (minimum is 26) For 12 items: 26 compares (minimum is 30) For 13 items: 47 compares (minimum is 34) For 14 items: 52 compares (minimum is 38) For 15 items: 71 compares (minimum is 42)
The input data was not chosen to be pathological. Yet still, it appears to quickly drift farther from the minimum as the number of elements increase.
Would it be possible to design a sort that never exceeded the minimum number of comparisons? Yes, although given that the minimum number was only calculated to 15 items in a math paper... it would probably be quite computationally intensive to generate such a sorter. It would be worthwhile only if comparisons were enormously costly relative to the rest of the operations in the system.
One case where it might be worthwhile to seek a minimum would be when sorting slow/costly human decisions. (Though presumably about things more important than preferences of fruits and vegetables!)