Let's analyze again the evolution of the brain. A concept model is a model which fits a large number of entities. It has to be recorded, maybe, by the same hardware as the hardware that records a normal image-model. Also, there must be a connection between a concept model and every particular model covered by it.
By increasing the level of conceptualization (e.g. from "apple" to "fruit") the structure becomes very complex. The structure becomes even more complex when it evolves from "fruit" to "food". In theory, an evolutive process could produce this process but the increase of the complexity is so huge that it is difficult to believe that this could be produced without specialized hardware.
Level 2 is very close to level 3, but, as we see, no animal was able to reach level 3. Even the most advanced animals, like dolphins, have no tendency towards level 3.
The first drawings on cave walls were dated back to about 150000 years ago. Such drawings must be produced by some long-range image-models. But, such drawings are of no use without some explanations (symbolic messages). The reason is that the same drawing can be associated with a lot of situations. It is fair to consider that, at that moment, the primitive human beings were able to use a symbolic model for communication (a primitive language).
One idea is that the increasing capacity of the brain to make long range image-models was a support to make also symbolic models. This idea cannot be supported, based on MDT.
Indeed, the drawings made by 5 to 12 year old children are rather primitive drawings. At such age, children have very few long-range models. But they are able to make and operate symbolic models, including languages to communicate with computers.
Thus, it seems that the long-range image models are not necessary to make symbolic models. Also, this supports the idea that the symbolic models are made by a special hardware.
The existence of a specialized hardware is based on the following:
There is an image model and the associated label-model (a word). The word has a definition (based on other words). It is clear that there must be a hardware to record the image-model and another (associated) hardware to record the definition. On level 4, the image model does not exist anymore.
If this new hardware should be build based on evolution, it is difficult to understand why we have no intermediate stages. The dolphins, which are considered as the most advanced animals, have no tendency to build symbolic models.