If you download and unzip the archive, you get about 3000 files in 79 folders, each file containing a sysex-dump of a DX7 voice bank consisting of 32 voices. That's a lot of voices to sift through!
After listening to these voices for a while, you get the impression that they all sound the same. There are many reasons for that. Among the more prominent ones is the fact, that, chiefly, most DX7 sounds fall into one of three categories:
- E-Piano
- Brassy
- Other
In order to test that proposition (and to avoid having to check the same voice over and over again), I wrote a small tool to help me identifying duplicate voices in a collection of banks. You can point it at a folder and it will traverse all sub-folders and read all *.syx files that it finds along the way. At the same time it culls all init and null voices. Init voices are those that you get when you initialize a voice slot for editing on a DX7. Zero voices are those that are just a stream of 0s. Most likely they are originating from unused slots in a software librarian.
The voices are then "sanitized", i.e the voice parameter values are clamped to their legal ranges. This is necessary because voice authors sometimes watermark their creations by specifying bigger-than-legal values for certain parameters. Because such values cannot be specified ordinarily, the voice must have been copied (as in "pilfered"). Or so. In any case, the DX7 will simply clamp a value to the legal range (e.g. 0-99 for an operator's output level) when reading such a voice, so we'll do the same. A sanitized voice (118 bytes = 128 minus 10 for the voice name which we ignore) is then interpreted as a binary (118*8=944bit) number and inserted into a sorted list. This way, duplicates flock together and can easily be identified. Note that a "voice" in this sense is simply a certain configuration of (legal) voice parameter values. It is possible that many of these configurations sound exactly the same (e.g. if the parameters of the first two "stacks" of algorithm 5 are swapped). But they are different configurations and hence treated as different voices.
If we do that on Bobby Blues' DX7_AllTheWeb archive we find that there are in total 95936 (non-null, non-init) voices in 2998 banks. Of these nearly 96000 voices, 26761 are distinct, and among them 11937 unique. Thus, on average, a single voice appears 3.5 times, or more precisely (because 11937 of them appear exactly once), the ones that are not unique appear on average 5.7 times. Indeed, some voices appear as many as 92 times across various banks! There are also several duplicate banks, i.e two or more banks that contain exactly the same voices (though not necessarily in the same order). But there is also a sizable number of unique banks, i.e. banks that contain only unique voices.
set cover problem, where you want the smallest number of subsets covering the whole set. You can compute the exact number(s) using a branch-and-bound algorithm, but that is too much of a hassle. You get a good approximation by continuously adding those sets (banks) that contain the largest number of yet uncovered elements (that would be voices in our case), and then deleting those voices in all other banks that contain them. If a bank loses its last voice, it is empty and can be deleted. This algorithm assumes, of course, that we don't rearrange voices (move between banks to fill them up). We only want to eliminate complete banks that we don't need because all the voices therein are already contained in other banks.
Still, there are 26761 distinct voices, and if you want to audition them all they should keep you busy for a while. Well, at least you shouldn't expect too many duplicates anymore...