Data clustering

c go java js lua py vbs rkt scm pro

Clustering is an important technique for grouping data. Some examples include the following.

Customer/market segmentation
Graphics such as computer vision

Clustering is based on some idea of what it means for two entities to be "equal" or, in most cases, "almost equal" in some sense.

We will look at the following.

Clustering and visualization in general
Exact clustering (when possible)
Approximate clustering

Humans have a unique ability to abstract and recognize patterns and make abstract inferences from those recognized patterns.

Human brains are built for complex abstraction. What does that mean exactly?

In abstract art, something is taken away, something remains, one needs to then interpret what is meant or intended. To abstract is to take away from the essentials and thereby to ignore certain differences.

For most purposes, an abstraction is looking at similarities and ignoring differences. The similarity is what is the same. The difference is what is different.

In simple terms, abstraction is looking at similarities and ignoring differences.

Abstraction arises from a recognition of similarities between certain objects, situations, or processes in the real world, and the decision to concentrate on these similarities, and to ignore for the time being the differences.

Structured programming

Abstraction is the key to higher level intelligence. That is why so many questions are of the form, "What is the primary similarity and difference between ...".

Much of computer science programming languages involve looking at patterns between text and making abstractions.

How many triangles do you see?

Do you see the triangle?

Kanisza Triangle

There is no triangle! Your brain makes the triangle that you see.

Abstraction involves looking at similarities and differences and filling in missing details - sometimes appropriately, sometimes inappropriately.

How many triangles do you see now? Do they exist?

Kanizsa Triangle

To many, the triangle that is seen but does not exist and appears brighter than the surrounding area.

This type of illusion was discovered/created by Gaetano Kanizsa who popularized such illusions, in part from his 1976 Scientific American paper on the subject (though he had been working on such ideas for many years before this paper).

The triangle is still seen when just dots are present at the corners.

Kanizsa Triangle

Can a triangle that does not exist be "whiter than white"? White is white, right?

Here is a Necker cube. Which corner nearest to the viewer?

Do you see the Necker cube now? It does not exist except in your mind.

In programming terms, to abstract is to replace one or more parts of a program with a name that refers to the replaced parts (thus hiding the details). Here are some programming constructs that are used for abstraction.

constants and variables
procedures/functions with parameters
modules
objects and classes
... and many other concepts ...

Much of the work that I assign for programming language assignments follows the following general pattern. Here is the setup.

Provide example code and explanations for concept A.
Provide example code and explanations for concept B.

Here is the requirement.

Do something that combines concepts A and B into C.

By understanding A and B, including a lot of textual abstracting, one can then write/construct a program to do C.

Consider the following pattern (which could go on more in an extended example).

Two times 0 is 0. Two times 1 is 2. Two times 2 is 4. Two times 3 is 6. Two times 4 is 8. Two times 5 is 10. Two times 6 is 12. Two times 7 is 14. Two times 8 is 16. Two times 9 is 18.

A good programmer would immediately visually recognize the pattern and, if asked, could write a simple program such as the following to output that pattern. Here is the C code.

#include <stdio.h>

int main() {
	for (int i1=0; i1 <= 9; i1++) {
		printf("Two times %d is %d.\n",i1,2*i1);
		}
	return 0;
	}

Here is the output of the C code.

Two times 0 is 0.
Two times 1 is 2.
Two times 2 is 4.
Two times 3 is 6.
Two times 4 is 8.
Two times 5 is 10.
Two times 6 is 12.
Two times 7 is 14.
Two times 8 is 16.
Two times 9 is 18.

A not-so-good programmer would not see the pattern and might attempt the following.

Here is the C code.

#include <stdio.h>

int main() {
	printf("Two times 0 is 0\n");
	printf("Two times 1 is 2\n");
	printf("Two times 2 is 4\n");
	printf("Two times 3 is 6\n");
	printf("Two times 4 is 8\n");
	printf("Two times 5 is 10\n");
	printf("Two times 6 is 12\n");
	printf("Two times 7 is 14\n");
	printf("Two times 8 is 16\n");
	printf("Two times 9 is 18\n");
	return 0;
	}

Here is the output of the C code.

Two times 0 is 0
Two times 1 is 2
Two times 2 is 4
Two times 3 is 6
Two times 4 is 8
Two times 5 is 10
Two times 6 is 12
Two times 7 is 14
Two times 8 is 16
Two times 9 is 18

In both cases, the program has the same output.

But the first program has less redundancy (repetition) and is considered the better program.

In general, the smaller the program that produces the same effect is considered the better program.

Specifically, the better program is smaller but also minimizes any non-computer-checked redundancy.

That means that any parameter or concept that is important in the program and that could be changed should be changeable in one and only one place.

Note: If the computer is checking the redundancy (and not a human) than that redundancy is not necessarily bad. (Backup systems are redundant but useful). And there is redundancy in a program that cannot be avoided (e.g., variables with the same name - when they represent the same or different memory locations).

Bloom's taxonomy of educational objectives has a foundation of knowledge and remembering and a goal of abstract problem solving - evaluation and creating.

Think of music, playing a musical instrument, and creating a score of music to play. In computer terms, the musical analogy is as follows.

Using a computer (program) is listening to music.
Writing a program from a design (including pseudo-code) is playing an instrument, such as the C programming instrument, the Java programming instrument, etc.
Designing and creating pseudo-code to solve a problem is creating a score of music to be played.

How does one learn music? (After listening to music for a while).

A computer (science) programming approach often used is the following to a beginning programmer.

Think of some music you to which you like to listen.
Now create a musical score for that music.
Now play that musical score on your musical instrument for which you are here to learn.

Will that work well? How about the following.

Think of some music you to which you like to listen.
Here is a musical score (design, pseudo-code, etc.) that represents that music.
You will now learn how to play your instrument (write a program in a language that implements) using that musical score (design and pseudo-code) for the music to which you like to listen.

Humans can easily visualize 2D or 3D in graphics but higher dimensions are harder to visualize.

In data science, one often learns concepts using examples in 2D or 3D and then generalize via abstraction to many more dimensions.

Working in 2D or 3D can thus help one understand the method that then generalizes to higher dimensions.

2