Home Mushrooms Implementation of a multi-threaded game engine architecture. Bottleneck of von Neumann architecture

Mushrooms

Implementation of a multi-threaded game engine architecture. Bottleneck of von Neumann architecture

Introduction. Computer technology is developing at a rapid pace. Computing devices are becoming more powerful, smaller, more convenient, but recently the increase in device performance has become big problem. In 1965, Gordon Moore (one of the founders of Intel) came to the conclusion that "the number of transistors placed on an integrated circuit chip doubles every 24 months."

The first developments in the field of creating multiprocessor systems began in the 70s. long time the performance of the usual single-core processors was increased by increasing the clock frequency (up to 80% of the performance was determined only by the clock frequency) with a simultaneous increase in the number of transistors on a chip. fundamental laws physicists stopped this process: the chips began to overheat, the technological one began to approach the size of silicon atoms. All these factors have led to:

leakage currents have increased, as a result of which heat dissipation and power consumption have increased.
The processor has become much "faster" than the memory. Performance was degraded due to call latency. random access memory and loading the data into the cache.
there is such a thing as a "von Neumann bottleneck". It means the inefficiency of the processor architecture when executing a program.

Multiprocessor systems (as one of the ways to solve the problem) were not widely used, since they required expensive and difficult to manufacture multiprocessor motherboards. Based on this, productivity increased in other ways. The concept of multithreading turned out to be effective - the simultaneous processing of several streams of commands.

Hyper-Threading Technology (HTT) or superthreading technology that allows a processor to run multiple program threads on a single core. It was HTT, according to many experts, that became the prerequisite for the creation of multi-core processors. The execution by the processor of several program threads at the same time is called thread-level parallelism (TLP –thread-level parallelism).

To unlock the potential of a multi-core processor, the executable program must use all the computing cores, which is not always achievable. Old serial programs that could use only one core will no longer run faster on a new generation of processors, so programmers are increasingly involved in the development of new microprocessors.

1. General concepts

Architecture in a broad sense is a description of a complex system consisting of many elements.

In the process of development, semiconductor structures (microcircuits) evolve, therefore, the principles of constructing processors, the number of elements included in their composition, how their interaction is organized, are constantly changing. Thus, CPUs with the same basic principles of structure are usually called processors of the same architecture. And these principles themselves are called processor architecture (or microarchitecture).

The microprocessor (or processor) is the main component of a computer. It processes information, executes programs, and controls other devices in the system. The power of the processor determines how fast programs will run.

The core is the basis of any microprocessor. It consists of millions of transistors located on a silicon chip. The microprocessor is divided into special cells, which are called general purpose registers (RON). The work of the processor in general consists in extracting commands and data from memory in a certain sequence and executing them. In addition, in order to increase the speed of the PC, the microprocessor is equipped with an internal cache memory. cache memory is inner memory processor, used as a buffer (to protect against interruptions in communication with RAM).

The Intel processors used in IBM-compatible PCs have more than a thousand instructions and belong to the processors with an extended instruction set - CISC-processors (CISC - Complex Instruction Set Computing).

1.1 High performance computing. Parallelism

The pace of development of computer technology is easy to follow: from ENIAC (the first electronic digital computer general purpose) with a performance of several thousand operations per second to the Tianhe-2 supercomputer (1000 trillion floating point operations per second). This means that the speed of computing has increased by a trillion times in 60 years. The creation of high-performance computing systems is one of the most difficult scientific and technical tasks. While the computational speed technical means has grown by only a few million times, the overall speed of computing has grown by a trillion times. This effect is achieved through the use of parallelism at all stages of computing. Parallel computing requires the search for a rational distribution of memory, reliable ways of transferring information and coordinating computational processes.

1.2 Symmetric multiprocessing

Symmetric Multiprocessing (abbreviated SMP) or symmetric multiprocessing is a special architecture of multiprocessor systems in which several processors have access to a common memory. This is a very common architecture, widely used in recent times.

When using SMP, several processors work simultaneously in a computer, each on its own task. An SMP system with a high-quality operating system rationally distributes tasks between processors, ensuring an even load on each of them. However, there is a problem with memory reversal, because even uniprocessor systems require relatively big time. Thus, access to RAM in SMP occurs sequentially: first one processor, then the second.

Due to the above features, SMP systems are used exclusively in the scientific field, industry, business, extremely rarely in work offices. In addition to the high cost of hardware implementation, such systems require very expensive and high-quality software that provides multi-threaded execution of tasks. Ordinary programs (games, text editors) will not work effectively in SMP systems, since they do not provide this degree of parallelism. If you adapt any program for an SMP system, it will become extremely inefficient to work on single-processor systems, which leads to the need to create several versions of the same program for different systems. The exception is, for example, the ABLETON LIVE program (designed for creating music and preparing Dj-sets), which has support for multiprocessor systems. If run regular program on a multiprocessor system, it will still run a little faster than on a single processor. This is due to the so-called hardware interrupt (stopping the program for processing by the kernel), which is executed on another free processor.

An SMP system (like any other system based on parallel computing) imposes increased requirements on such a memory parameter as the memory bus bandwidth. This often limits the number of processors in a system (modern SMP systems work effectively with up to 16 processors).

Since processors have shared memory, it becomes necessary to use it rationally and coordinate data. In a multiprocessor system, it turns out that several caches work for a shared memory resource. Cache coherence is a cache property that ensures the integrity of data stored in individual caches for a shared resource. This concept- a special case of the concept of memory coherence, where several cores have access to a common memory (it is ubiquitous in modern multi-core systems). If we describe these concepts in general terms, then the picture will be as follows: the same block of data can be loaded into different caches, where the data is processed differently.

If any data change notifications are not used, an error will occur. Cache coherency is designed to resolve such conflicts and maintain consistency of data in caches.

SMP systems are a subgroup of MIMD (multi in-struction multi data - computing system with multiple instruction stream and multiple data stream) classification of computing systems according to Flynn (Professor at Stanford University, co-founder of Palyn Associates). According to this classification, almost all varieties of parallel systems can be attributed to MIMD.

The division of multiprocessor systems into types occurs on the basis of division according to the principle of memory use. This approach made it possible to distinguish the following important types

multiprocessor systems - multiprocessors (multiprocessor systems with shared shared memory) and multicomputers (systems with separate memory). Shared data used in parallel computing requires synchronization. The task of data synchronization is one of the most important issues, and its solution in the development of multiprocessor and multicore and, accordingly, the necessary software is a priority task for engineers and programmers. Data can be shared with physical memory allocation. This approach is called non-uniform memory access (NUMA).

These systems include:

Systems where only the individual processor cache is used to represent data (cache-only memory architecture).
Systems with local cache coherence for different processors (cache-coherent NUMA).
Systems with provision public access to the individual memory of processors without the implementation of non-cache coherent NUMA at the hardware level.

Simplification of the problem of creating multiprocessor systems is achieved by using distributed shared memory, but this method leads to a significant increase in the complexity of parallel programming.

1.3 Simultaneous multithreading

Based on all the above disadvantages of symmetric multiprocessing, it makes sense to develop and develop other ways to improve performance. If you analyze the operation of each individual transistor in the processor, you can pay attention to a very interesting fact - when performing most computational operations, far from all processor components are involved (according to latest research- about 30% of all transistors). Thus, if the processor performs, say, a simple arithmetic operation, then most of the processor is idle, therefore, it can be used for other calculations. So, if the processor is currently performing real operations, then an integer arithmetic operation can be loaded into the free part. To increase the load on the processor, you can create speculative (or advanced) execution of operations, which requires a great complication of the processor hardware logic. If the program pre-determines threads (sequences of commands) that can be executed independently of each other, then this will significantly simplify the task (this method is easily implemented at the hardware level). This idea, which belongs to Dean Tulsen (developed by him in 1955 at the University of Washington), is called simul-taneous multithreading. It was later developed by Intel under the name hyperthreading. Thus, one processor executing many threads is perceived as an operating Windows system like multiple processors. The use of this technology again requires an appropriate level of software. Maximum effect from the use of multithreading technology is about 30%.

1.4 Multi-core

Multithreading technology is the implementation of multi-core at the software level. Further increase in performance, as always, requires changes in the hardware of the processor. The complication of systems and architectures is not always effective. There is an opposite opinion: “everything ingenious is simple!”. Indeed, in order to increase the performance of the processor, it is not at all necessary to increase its clock frequency, complicate the logical and hardware components, since it is enough just to rationalize and refine the existing technology. This method is very profitable - there is no need to solve the problem of increasing the heat dissipation of the processor, the development of new expensive equipment for the production of microcircuits. This approach was implemented as part of the multi-core technology - the implementation of several computing cores on a single chip. If you take the original processor and compare the performance gains from implementing multiple performance enhancements, it's clear that multi-core technology is the best option.

If we compare the architectures of a symmetric multiprocessor and a multi-core one, they will turn out to be almost identical. The cache memory of the cores can be multi-level (local and shared, and data from RAM can be loaded directly into the second-level cache memory). Based on the considered advantages multi-core architecture processors, manufacturers focus on it. This technology turned out to be quite cheap to implement and universal, which made it possible to bring it to a wide market. In addition, this architecture has made its own adjustments to Moore's law: "the number of computing cores in the processor will double every 18 months."

If you look at modern market computer technology, you can see that devices with four- and eight-core processors dominate. In addition, processor manufacturers say that processors with hundreds of processing cores will soon be seen on the market. As has been repeatedly said before, the full potential of a multi-core architecture is revealed only with high-quality software. Thus, the sphere of production of computer hardware and software is very closely related.

But with the conquest of new peaks in frequency indicators, it became more difficult to increase it, as this affected the increase in TDP of processors. Therefore, the developers began to grow processors in width, namely, to add cores, and the concept of multi-core arose.

Literally 6-7 years ago, multi-core processors were practically unheard of. No, multi-core processors from the same IBM company existed before, but the appearance of the first dual-core processor for desktop computers, took place only in 2005, and this processor was called Pentium D. Also, in 2005, AMD's dual-core Opteron was released, but for server systems.

In this article, we will not delve into historical facts in detail, but will discuss modern multi-core processors as one of the characteristics of the CPU. And most importantly - we need to figure out what this multi-core gives in terms of performance for the processor and for you and me.

Increased performance with multi-core

The principle of increasing processor performance due to several cores is to split the execution of threads ( various tasks) to multiple cores. In summary, almost every process running on your system has multiple threads.

I’ll make a reservation right away that the operating system can virtually create many threads for itself and do it all at the same time, even if the processor is physically single-core. This principle implements the same Windows multitasking (for example, listening to music and typing at the same time).

Let's take for example antivirus program. We will have one thread scanning the computer, the other - updating the anti-virus database (we have simplified everything in order to understand the general concept).

And consider what will happen in two different cases:

a) Single core processor. Since two threads are running at the same time, we need to create for the user (visually) this very simultaneity of execution. The operating system does tricky:there is a switch between the execution of these two threads (these switches are instantaneous and the time is in milliseconds). That is, the system “performed” the update a little, then abruptly switched to scanning, then back to updating. Thus, for you and me, it seems that these two tasks are being carried out simultaneously. But what is being lost? Of course, performance. So let's look at the second option.

b) The processor is multi-core. AT this case this switch will not occur. The system will clearly send each thread to a separate core, which, as a result, will allow us to get rid of the switching from thread to thread that is detrimental to performance (let's idealize the situation). Two threads run simultaneously, this is the principle of multi-core and multi-threading. Ultimately, we will perform scans and updates much faster on a multi-core processor than on a single-core one. But there is a catch - not all programs support multi-core. Not every program can be optimized this way. And everything happens far from being as perfect as we have described. But every day, developers create more and more programs whose code is perfectly optimized for execution on multi-core processors.

Are multi-core processors necessary? Everyday reasonableness

At choice of processor for a computer (namely, when thinking about the number of cores), one should determine the main types of tasks that it will perform.

To improve knowledge in the field of computer hardware, you can read the material about processor sockets .

The starting point can be called dual-core processors, since it makes no sense to return to single-core solutions. But dual-core processors are different. It may not be the "most" fresh Celeron, or it may be a Core i3 on Ivy Bridge, just like AMD - Sempron or Phenom II. Naturally, due to other indicators, their performance will be very different, so you need to look at everything comprehensively and compare multi-core with others. processor characteristics.

For example, the Core i3 on Ivy Bridge has Hyper-Treading technology, which allows you to process 4 threads simultaneously (the operating system sees 4 logical cores, instead of 2 physical ones). And the same Celeron does not boast of such.

But let's return directly to the reflections on the required tasks. If a computer is needed for office work and surfing the Internet, then a dual-core processor is enough for it.

When it comes to gaming performance, you need 4 cores or more to be comfortable in most games. But here the very catch pops up: not all games have optimized code for 4-core processors, and if they are optimized, it is not as efficient as we would like. But, in principle, for games now the optimal solution is precisely the 4th core processor.

Today, the same 8-core AMD processors are redundant for games, it is the number of cores that is redundant, but the performance is not up to par, but they have other advantages. These same 8 cores will help a lot in tasks where powerful work with a high-quality multi-threaded load is needed. This includes, for example, rendering (calculation) of video, or server computing. Therefore, for such tasks, 6, 8 or more cores are needed. Yes, and soon games will be able to load high-quality 8 and more cores, so in the future, everything is very rosy.

Do not forget that there are still a lot of tasks that create a single-threaded load. And you should ask yourself the question: do I need this 8-core or not?

Summing up a little, I would like to note once again that the advantages of multi-core are manifested during "heavy" computational multi-threaded work. And if you do not play games with exorbitant requirements and do not do specific types of work that require good computing power, then spending money on expensive multi-core processors simply does not make sense (

Having dealt with the theory of multithreading, let's consider a practical example - Pentium 4. Already at the development stage of this processor, Intel engineers continued to work on increasing its performance without making changes to software interface. Five simple methods were considered:

Increasing the clock frequency;

Placing two processors on one chip;

Introduction of new functional blocks;

Conveyor extension;

Using multithreading.

The most obvious way to improve performance is to increase the clock speed without changing other parameters. As a rule, each subsequent processor model has a slightly higher clock speed than the previous one. Unfortunately, with a straight-line increase in clock speed, developers face two problems: increased power consumption (which is relevant for laptops and other computing devices that run on batteries) and overheating (which requires more efficient heat sinks).

The second method - placing two processors on a chip - is relatively simple, but it involves doubling the area occupied by the chip. If each processor is provided with its own cache memory, the number of chips per wafer is halved, but this also means twice the cost of production. If both processors have a shared cache memory, a significant increase in the occupied space can be avoided, but in this case another problem arises - the amount of cache memory per processor is halved, and this inevitably affects performance. In addition, while professional server applications can take full advantage of the resources of multiple processors, conventional desktop programs internal parallelism is developed to a much lesser extent.

The introduction of new functional blocks is also not difficult, but it is important to strike a balance here. What's the point of a dozen ALUs if the chip can't issue commands to the pipeline at a rate that can load all those blocks?

A pipeline with an increased number of stages, capable of dividing tasks into smaller segments and processing them in short periods of time, on the one hand, increases productivity, on the other hand, enhances Negative consequences misprediction of branches, cache misses, interrupts, and other events that disrupt the normal processing of instructions in the processor. In addition, to fully realize the capabilities of the extended pipeline, it is necessary to increase the clock frequency, and this, as we know, leads to increased power consumption and heat dissipation.

Finally, you can implement multithreading. The advantage of this technology is the introduction of an additional program flow, allowing you to bring into use those hardware resources that would otherwise be idle. Based on the results of experimental studies, Intel developers found that a 5% increase in chip area when implementing multithreading for many applications gives a performance increase of 25%. Xeon was the first Intel processor to support multithreading in 2002. Subsequently, starting at 3.06 GHz, multithreading was introduced into the Pentium 4 line. Intel calls the implementation of multithreading in the Pentium 4 hyperthreading.

* always topical issues, what you should pay attention to when choosing a processor, so as not to be mistaken.

Our goal in this article is to describe all the factors that affect processor performance and other performance characteristics.

It's probably not a secret for anyone that the processor is the main computing unit of a computer. You could even say - the most important part of the computer.

It is he who handles almost all the processes and tasks that occur in the computer.

Whether it is watching videos, music, Internet surfing, writing and reading in memory, processing 3D and video, games. And many more.

Therefore, to choose C central P processor, should be treated very carefully. It may turn out that you decide to install a powerful video card and a processor that does not correspond to its level. In this case, the processor will not reveal the potential of the video card, which will slow down its work. The processor will be fully loaded and literally boil, and the video card will wait for its turn, working at 60-70% of its capabilities.

That is why, when choosing a balanced computer, not costs neglect the processor in favor of a powerful video card. The processor power should be enough to unlock the potential of the video card, otherwise it's just money thrown away.

Intel vs. AMD

*chase forever

Corporation Intel, has huge by human resourses, and almost inexhaustible finances. Many innovations in the semiconductor industry and new technologies come from this company. Processors and developments Intel, on average for 1-1,5 years ahead of the developments of engineers AMD. But as you know, you have to pay for the opportunity to have the most modern technologies.

Processor pricing policy Intel, is based on number of cores, cache amount, but also on "freshness" of architecture, performance per clockwatt,chip process technology. The value of the cache memory, the "subtleties of the technical process" and other important characteristics of the processor will be considered below. For the possession of such technologies as a free frequency multiplier, you will also have to pay an additional amount.

Company AMD, unlike the company Intel, strives for the availability of its processors for the end consumer and for a competent pricing policy.

One might even say that AMD– « People's stamp". In its price tags you will find what you need at a very attractive price. Usually one year after the introduction of a new technology, the company Intel, an analogue of the technology from AMD. If you are not chasing the highest performance and pay more attention to the price tag than to the presence of advanced technologies, then the company's products AMD- just for you.

Price policy AMD, is based more on the number of cores and very little on the amount of cache memory, the presence of architectural improvements. In some cases, for the opportunity to have a cache memory of the third level, you will have to pay a little extra ( Phenom has a cache memory level 3, Athlon content with only limited, 2 levels). But sometimes AMD spoils his fans ability to unlock cheaper processors to more expensive ones. You can unlock cores or cache memory. Improve Athlon before Phenom. This is possible due to the modular architecture and the lack of some cheaper models, AMD simply disables some more expensive on-chip blocks (by software).

Nuclei– remain practically unchanged, only their number differs (valid for processors 2006-2011 years). Due to the modularity of its processors, the company does an excellent job of selling rejected chips, which, when some blocks are turned off, become a processor from a less productive line.

The company has been working on a completely new architecture for many years under the code name Bulldozer, but at the time of release 2011 year, new processors showed not the best performance. AMD sinned on operating systems that they do not understand the architectural features of dual cores and "other multithreading."

According to company representatives, you should wait for special fixes and patches to feel the full performance of these processors. However, at the beginning 2012 year, company representatives postponed the release of an update to support the architecture Bulldozer for the second half of the year.

Processor frequency, number of cores, multithreading.

At times Pentium 4 and before him CPU frequency, was the main processor performance factor when choosing a processor.

This is not surprising, because processor architectures were specially designed to achieve high frequencies, this was especially reflected in the processor Pentium 4 on architecture netburst. High frequency was not effective with the long pipeline that was used in the architecture. Even Athlon XP frequency 2GHz, in terms of performance was higher than Pentium 4 c 2.4GHz. So it was pure water marketing. After this error, the company Intel I realized my mistakes and back to the good side I started working not on the frequency component, but on the performance per clock. From architecture netburst had to refuse.

What us gives multi-core?

Quad-core processor 2.4 GHz, in multi-threaded applications, would theoretically be roughly equivalent to a single-core processor with a frequency of 9.6GHz or 2-core processor with a frequency 4.8 GHz. But that's only in theory. Practically on the other hand, two dual-core processors in two socket motherboards will be faster than one 4-core processor at the same operating frequency. Bus speed limits and memory latencies make themselves felt.

* subject to the same architectures and the amount of cache memory

Multi-core, makes it possible to execute instructions and calculations in parts. For example, you need to perform three arithmetic operations. The first two are executed on each of the processor cores and the results are added to the cache memory, where the next action can be performed with them by any of the free cores. The system is very flexible, but without proper optimization it may not work. Therefore, optimization for multi-core for the architecture of processors in the OS environment is very important.

Apps that "love" and use multithreading: archivers, video players and encoders, antiviruses, defragmenter programs, graphic editor, browsers, Flash.

Also, the "fans" of multithreading include such operating systems as Windows 7 and Windows Vista, as well as many OS, based on the kernel linux, which run noticeably faster with a multi-core processor.

Most games, sometimes a 2-core processor at a high frequency is quite enough. Now, however, everything comes out more games"Sharpened" for multithreading. Take at least these sandbox games like GTA 4 or prototype, in which on a 2-core processor with a frequency below 2.6 GHz- you don’t feel comfortable, the frame rate falls below 30 frames per second. Although in this case, most likely the cause of such incidents is the "weak" optimization of games, lack of time or "not direct" hands of those who transferred games from consoles to PC.

When buying a new processor for games, now you should pay attention to processors with 4 or more cores. But still, do not neglect the 2-core processors from the "upper category". In some games, these processors sometimes feel better than some multi-core ones.

Processor cache.

- this is a dedicated area of \u200b\u200bthe processor chip, in which intermediate data is processed and stored between processor cores, RAM and other buses.

It runs at a very high clock speed (usually at the frequency of the processor itself), has a very high bandwidth, and processor cores work with it directly ( L1).

Because of her shortage, the processor can be idle in time-consuming tasks, waiting for new data to be processed in the cache. Also cache memory serves for records of frequently repeated data that can be quickly restored if necessary without unnecessary calculations, without forcing the processor to spend time on them again.

Performance also adds the fact that if the cache memory is combined, and all cores can equally use the data from it. This gives additional opportunities for multi-threaded optimization.

This technique is now used for level 3 cache. For processors Intel there were processors with a combined level 2 cache ( C2D E 7***,E8***), thanks to which this method appeared to increase multithreaded performance.

When overclocking the processor, the cache memory can become a weak point, preventing the processor from overclocking more than its maximum operating frequency without errors. However, the advantage is that it will run at the same frequency as the overclocked processor.

In general, the larger the cache memory, the faster CPU. In which applications?

In all applications where a lot of floating point data, instructions and threads are used, cache memory is actively used. Cache memory is very popular archivers, video encoders, antiviruses and graphic editor etc.

favorable to a large number caches are games. Especially strategies, auto-sims, RPGs, SandBox and all games where there are a lot of small details, particles, geometry elements, information flows and physical effects.

Cache memory plays a very important role in unlocking the potential of systems with 2 or more video cards. After all, some part of the load falls on the interaction of the processor cores both among themselves and for working with the streams of several video chips. It is in this case that the organization of the cache memory is important, and the cache memory of the 3rd level of a large volume is very useful.

Cache memory is always equipped with protection against possible errors ( ECC), upon detection of which, they are corrected. This is very important, because a small error in the cache memory, during processing, can turn into a giant, continuous error, from which the whole system will “lie down”.

Corporate technologies.

(hyper-threading, HT)–

for the first time the technology was applied in processors Pentium 4, but it did not always work correctly and often slowed down the processor more than accelerated it. The reason was a too long pipeline and an unfinished branch prediction system. Applied by the company Intel, there are no analogues of the technology yet, if not considered an analogue then? what the engineers of the company implemented AMD in architecture Bulldozer.

The principle of the system is such that for each physical core, two computing threads, instead of one. That is, if you have a 4-core processor with HT (Core i 7), then you have virtual threads 8 .

The performance gain is achieved due to the fact that data can enter the pipeline already in its middle, and not necessarily at the beginning. If some processor units capable of performing this action are idle, they receive a task to be executed. The performance increase is not the same as for real physical cores, but comparable (~ 50-75%, depending on the type of application). It is quite rare that in some applications, HT negatively affects on performance. This is due to poor optimization of applications for this technology, the inability to understand that there are "virtual" threads and the lack of limiters for loading threads evenly.

TurboBoost - a very useful technology that increases the frequency of the most used processor cores, depending on their level of workload. It is very useful when an application cannot use all 4 cores, and loads only one or two, while their frequency increases, which partially compensates for performance. An analogue of this technology in the company AMD, is the technology Turbo Core.

, 3 now! instructions. Designed to speed up the processor in multimedia calculations (video, music, 2D/3D graphics, etc.), as well as speed up the work of such programs as archivers, programs for working with images and video (with the support of instructions by these programs).

3now! - pretty old technology AMD, which contains additional instructions for processing multimedia content, in addition to SSE first version.

* Namely, the possibility of stream processing of real numbers of single precision.

The presence of the new version- is a big plus, the processor begins to perform certain tasks more efficiently with proper software optimization. Processors AMD wear similar titles, but slightly different.

* Example - SSE 4.1 (Intel) - SSE 4A (AMD).

In addition, these instruction sets are not identical. These are analogues, in which there are slight differences.

cool'n'quiet, speedstep, CoolCore, Enhanced half State(C1E) andt. d.

These technologies, at low load, reduce the frequency of the processor by reducing the multiplier and core voltage, disabling part of the cache, etc. This allows the processor to heat up much less and consume less energy, make less noise. If power is needed, the processor will return to its normal state in a split second. On standard settings bios almost always enabled, if desired, they can be disabled to reduce possible "friezes" when switching in 3D games.

Some of these technologies control the speed of the fans in the system. For example, if the processor does not need enhanced heat dissipation and is not under load, the processor fan speed is reduced ( AMD Cool'n'Quiet, Intel Speed Step).

Intel Virtualization Technology and AMD Virtualization.

These hardware technologies allow, with the help of special programs, to run several operating systems at once, without any significant loss in performance. Also, it is used for correct operation servers, because often, they have more than one OS installed on them.

Execute Disable Bit andno eXecute Bit – a technology designed to protect a computer from virus attacks and software bugs that can cause a system crash by buffer overflow.

Intel 64 , AMD 64 , EM 64 T - this technology allows the processor to work both in OS with 32-bit architecture and in OS with 64-bit architecture. System 64bit- in terms of benefits, for the average user, it differs in that more than 3.25 GB of RAM can be used in this system. On 32 bit systems, use b about More RAM is not possible due to the limited amount of addressable memory* .

Most applications with a 32-bit architecture can be run on a system with a 64-bit OS.

* What to do if back in 1985, no one could even think about such gigantic, by the standards of that time, amounts of RAM.

Additionally.

A few words about

This point is worth mentioning close attention. The thinner the technical process, the less the processor consumes energy and, as a result, it heats up less. And among other things - it has a higher margin of safety for overclocking.

The thinner the technical process, the more you can "wrap" in the chip (and not only) and increase the capabilities of the processor. Heat dissipation and power consumption also decrease proportionally, due to lower current losses and a decrease in the core area. You can see a trend that with each new generation of the same architecture on a new process technology, power consumption is also growing, but this is not so. It's just that manufacturers are moving towards even greater performance and are stepping over the heat dissipation line of the previous generation of processors due to an increase in the number of transistors, which is not proportional to a decrease in the technical process.

built into the processor.

If you do not need an integrated video core, then you should not buy a processor with it. You will only get worse heat dissipation, extra heat (not always), worse overclocking potential (not always), and overpaid money.

In addition, those cores that are built into the processor are only suitable for loading the OS, surfing the Internet and watching videos (and even then not of any quality).

Market trends are still changing and the opportunity to buy a productive processor from Intel Without a video core, it drops out less and less. The policy of forced imposition of the built-in video core, appeared with processors Intel codenamed Sandy Bridge, the main innovation of which was the built-in core on the same manufacturing process. The video core is located jointly with processor on one crystal, and not as simple as in previous generations of processors Intel. For those who do not use it, there are disadvantages in the form of some overpayment for the processor, the displacement of the heating source relative to the center of the heat distribution cover. However, there are also pluses. Disabled video core, can be used for very fast video encoding using technology Quick Sync coupled with special software that supports this technology. In future, Intel promises to expand the horizons of using the built-in video core for parallel computing.

Sockets for processors. Platform lifespans.

Intel leads a rough policy for their platforms. The lifespan of each (the date of the beginning and end of sales of processors for it) usually does not exceed 1.5 - 2 years. In addition, the company has several parallel developing platforms.

Company AMD, has the opposite compatibility policy. To her platform AM 3, all processors of future generations that support DDR3. Even when the platform goes to AM3+ and later, either new processors under AM 3, or new processors will be compatible with old motherboards, and it will be possible to make a painless upgrade for the wallet by changing only the processor (without changing the motherboard, RAM, etc.) and flashing the motherboard. The only nuances of incompatibility may be when changing the type, since a different memory controller built into the processor will be required. So compatibility is limited and not supported by all motherboards. But in general, for an economical user or those who are not used to changing the platform completely every 2 years - the choice of the processor manufacturer is understandable - this AMD.

CPU cooling.

Comes with processor as standard BOX-new cooler that will just do the job. It is a piece of aluminum with a not very high dispersion area. Efficient coolers based on heat pipes and plates attached to them are designed for highly efficient heat dissipation. If you don't want to hear excessive fan noise, then you should consider purchasing an alternative, more efficient heatpipe cooler, or a closed-loop or non-closed-loop liquid cooling system. Such cooling systems will additionally enable overclocking for the processor.

Conclusion.

All important aspects that affect processor performance and performance have been considered. Let's recap what to look out for:

Select manufacturer
Processor architecture
Process technology
CPU frequency
Number of processor cores
Processor cache size and type
Support for technologies and instructions
Quality cooling

We hope given material will help you understand and decide on the choice of the processor that meets your expectations.

saul September 9, 2015 at 13:38

Implementing a multi-threaded game engine architecture

Intel Blog,
game development,
parallel programming,
Website development

Translation

With the advent of multi-core processors, it became necessary to create a game engine based on a parallel architecture. The use of all processors in the system - both graphics (GPU) and central processor (CPU) - opens up much more possibilities compared to a single-threaded GPU-only engine. For example, by using more CPU cores, you can improve visuals by increasing the number of physical objects used in the game, as well as achieve more realistic character behavior through the implementation of advanced artificial intelligence (AI).
Consider the features of the implementation of the multi-threaded architecture of the game engine.

1. Introduction

1.1. Review

The multi-threaded architecture of the game engine allows you to use the capabilities of all platform processors to the maximum. It involves the parallel execution of various functional blocks on all available processors. However, it is not so easy to implement such a scheme. Separate elements of the game engine often interact with each other, which can lead to errors when they are executed at the same time. To handle such scenarios, the engine provides special data synchronization mechanisms that exclude possible blocking. It also implements concurrent data synchronization techniques to keep execution time to a minimum.

To understand the material presented, you need to have a good understanding of modern methods creating computer games, supporting multithreading for game engines, or to improve the performance of applications in general.

2. State of parallel execution

Parallel execution state is a key concept of multithreading. Only by dividing the game engine into separate systems, each operating in its own mode and practically not interacting with the rest of the engine, can one achieve the greatest efficiency in parallel computing and reduce the time required for synchronization. It is not possible to completely isolate individual parts of the engine, excluding all common resources. However, for operations such as retrieving the position or orientation of objects, individual systems may use local copies of the data rather than shared resources. This allows you to minimize the dependence of data in different parts of the engine. Notifications about changes to general data made separate system, are passed to the state manager, which queues them up. This is called the messaging mode. This mode assumes that when tasks are completed, engine systems are notified of changes and update their internal data accordingly. This mechanism can significantly reduce the time of synchronization and the dependence of systems on each other.

2.1 Run states

In order for the execution state manager to work efficiently, it is recommended to synchronize operations on a specific clock pulse. This allows all systems to work simultaneously. In this case, the clock rate does not have to correspond to the frame rate. And the duration of cycles may not depend on the frequency. It can be chosen in such a way that one cycle corresponds to the time required to transmit one frame (regardless of its size). In other words, the frequency or duration of cycles is determined by the specific implementation of the state manager. Figure 1 shows a "free" stepping mode of operation, which does not require all systems to complete an operation in the same clock cycle. The mode in which all systems complete the execution of operations in one clock cycle is called "hard" stepping mode. It is schematically shown in Figure 2.

Figure 1. Execution status in free stepping mode

2.1.1. Free turn mode

In the free step-by-step mode, all systems operate continuously for a predetermined period of time required to complete the next portion of the calculations. However, the name “free” should not be taken literally: the systems are not synchronized at an arbitrary moment in time, they are only “free” in choosing the number of cycles required to complete the next stage.
Typically, in this mode, it is not enough to send a simple state change notification to the state manager. It is also necessary to transmit updated data. This is because the system that has changed the shared data may be in progress while another system that is waiting for the data is ready to update. In this case, more memory is required because more copies of the data need to be created. Therefore, the "free" mode cannot be considered a universal solution for all occasions.

2.1.2. Hard turn mode

In this mode, the execution of tasks of all systems is completed in one clock cycle. This mechanism is easier to implement and does not require the transmission of updated data along with the notification. Indeed, if necessary, one system can simply request new values from another system (of course, at the end of the run cycle).
In hard mode, it is possible to implement a pseudo-free stepping mode of operation by distributing computations among different steps. In particular, this may be required for AI calculations, where the initial "common goal" is calculated in the first cycle, which is gradually refined in the following stages.

Figure 2. Execution status in hard stepping mode

2.2. Data synchronization

Modifying shared data by multiple systems can result in conflicting changes. In this case, the messaging system needs to provide an algorithm for choosing the correct total value. There are two main approaches based on the following criteria.

Time: The final value is the last change made.
Priority: The final value is the change made by the system with the highest priority. If the priority of the systems is the same, you can also take into account the timing of the changes.

All obsolete data (according to any of the criteria) can simply be overwritten or excluded from the notification queue.
Since the total value can depend on the order in which the changes are made, it can be very difficult to use relative values for the total data. In such cases, you should use absolute values. Then, when updating local data, systems can simply replace the old values with the new ones. The optimal solution is to choose absolute or relative values depending on specific situation. For example, general data such as position and orientation should have absolute values because the order in which changes are made is important to them. Relative values can be used, for example, for a particle generation system, since all information about particles is stored only in itself.

3. Engine

When developing the engine, the focus is on the flexibility needed to further extend its functionality. This will optimize it for use under certain constraints (for example, memory).
The engine can be conditionally divided into two parts: framework and managers. The framework (see section 3.1) includes parts of the game that are replicated at runtime, that is, they exist in multiple instances. It also includes the elements involved in the execution of the main game loop. Managers (see section 3.2) are Singleton objects responsible for executing the logical part of the game.
Below is a diagram of the game engine.

Figure 3. General architecture of the engine

Please note that functional game modules, or systems, are not part of the engine. The engine only unites them with each other, acting as a connecting element. This modular organization makes it possible to load and unload systems as needed.

The interaction of the engine and systems is carried out using interfaces. They are implemented in such a way as to give the engine access to the functions of the systems, and the systems to the managers of the engine.
A detailed diagram of the engine is provided in Appendix A, “Engine Diagram”.

In fact, all systems are independent of each other (see Section 2, “Concurrent Execution Status”), which means that they can execute actions in parallel without affecting the operation of other systems. However, any data change will entail certain difficulties, since the systems will have to interact with each other. The exchange of information between systems is necessary in the following cases:

to inform another system about a change in shared data (for example, the position or orientation of objects);
to perform functions that are not available for this system (for example, the AI system calls the system for calculating the geometric or physical properties of the object to perform a ray crossing test).

In the first case, the state manager described in the previous section can be used to manage the exchange of information. (See Section 3.2.2, “State Manager” for more information about the state manager.)
In the second case, it is necessary to implement a special mechanism that will allow you to provide services from one system for use by another. A full description of this mechanism is provided in Section 3.2.3, Service Manager.

3.1. framework

The framework serves to combine all the elements of the engine. It is where the engine is initialized, with the exception of managers, which are instantiated globally. It also stores scene information. To achieve greater flexibility, the scene is implemented as a so-called universal scene, which contains universal objects. They are containers that combine various functional parts of the scene. See section 3.1.2 for details.
The main game loop is also implemented in the framework. Schematically, it can be represented as follows.

Figure 4. Main game loop

The engine runs in a windowed environment, so the first step in the game loop is to process all pending OS window messages. If this is not done, the engine will not respond to OS messages. In the second step, the scheduler assigns tasks using the task manager. This process is detailed in section 3.1.1 below. After that, the state manager (see section 3.2.2) sends information about the changes made to the engine systems that it can affect. In the last step, depending on the execution status, the framework determines whether to terminate or continue the engine, for example, to move to the next scene. Information about the state of the engine is stored in the environment manager. See section 3.2.4 for details.

3.1.1. Scheduler

The scheduler generates an execution reference clock at a specified frequency. If the benchmarking mode requires that the next operation start immediately after the completion of the previous one, without waiting for the end of the cycle, the frequency can be unlimited.
On a clock signal, the scheduler, with the help of the task manager, puts the systems into execution mode. In free stepping mode (Section 2.1.1), the scheduler polls systems to determine how many ticks they need to complete a task. Based on the results of the poll, the scheduler determines which systems are ready to run and which will complete work in a particular cycle. The scheduler can change the number of ticks if any system needs more time to execute. In hard stepping mode (Section 2.1.2), all systems start and finish execution on the same clock cycle, so the scheduler waits for all systems to finish executing.

3.1.2. Universal Scene and Objects

The Universal Scene and objects are containers for functionality implemented in other systems. They are intended solely for interaction with the engine and do not perform any other functions. However, they can be extended to take advantage of features available on other systems. This allows for weak coupling. Indeed, the universal scene and objects can use the properties of other systems without being tied to them. It is this property that excludes the dependence of systems on each other and makes it possible for them to work simultaneously.
The diagram below shows the expansion of the universal scene and object.

Figure 5. Expanding the universal scene and object

Consider the principle of the extensions on the following example. Let's say that a universal universal scene is extended, the scene is extended to use the use of graphical, physical, and other properties. In this case, the “graphical” part of the extension will be responsible for initializing the display, and its “physical” part will be responsible for implementing physical laws for solids, such as gravity. Scenes contain objects, so a generic scene will also include multiple generic objects. Generic objects can also be extended to use graphical, physical, and other properties. For example, drawing an object on the screen will be implemented graphic functions expansion, and the calculation of the interaction of solids - physical.

A detailed diagram of the interaction of the engine and systems is given in Appendix B, "The diagram of the interaction of the engine and systems."
Note that the generic scene and the generic object are responsible for registering all of their "extensions" with the state manager so that all extensions can be notified of changes made by other extensions (i.e. other systems). An example would be a graphics extension registered to receive notifications of position and orientation changes made by a physical extension.
For detailed information about system components, see Section 5.2, System Components.

3.2. Managers

Managers manage the operation of the engine. They are Singleton objects, meaning there is only one instance of each manager type. This is necessary because duplication of manager resources will inevitably lead to redundancy and negatively affect performance. In addition, managers are responsible for implementing common functions for all systems.

3.2.1. Task Manager

The task manager is responsible for managing system tasks in the thread pool. To ensure optimal nx scaling and to prevent redundant threads from being assigned, eliminating unnecessary task switching overhead in the operating system, the thread pool creates one thread per processor.

The scheduler gives the task manager a list of tasks to execute, as well as information about which tasks to complete to wait. It receives this information from various systems. Each system gets only one task to execute. This method is called functional decomposition. However, for data processing, each such task can be divided into an arbitrary number of subtasks (data decomposition).
Below is an example of the distribution of tasks between threads for a quad-core system.

Figure 6. Example of a thread pool used by the task manager

In addition to processing scheduler requests for access to the main tasks, the task manager can work in the initialization mode. It sequentially polls the systems from each thread so that they can initialize the local data stores necessary for work.
Tips for implementing a task manager are given in Appendix D, Tips for Implementing Tasks.

3.2.2. State Manager

The state manager is part of the messaging mechanism. It tracks changes and sends notifications about them to all systems that may be affected by these changes. In order not to send unnecessary notifications, the state manager stores information about which systems to notify in a particular case. This mechanism is implemented based on the Observer pattern (see Appendix C, Observer (Design Pattern)). In short, this pattern involves the use of an "observer" that watches for any changes to the subject, while the change controller acts as an intermediary between them.

The mechanism works as follows. 1. The observer tells the change controller (or state manager) which subjects it wants to monitor for changes. 2. The subject notifies the controller of all its changes. 3. At the signal of the framework, the controller notifies the observer about changes in the subject. 4. The observer sends a request to the subject to receive updated data.

In free mode step-by-step execution(see section 2.1.1) the implementation of this mechanism is somewhat more complicated. First, the updated data will have to be sent along with the change notification. Polling is not applicable in this mode. Indeed, if the system responsible for the changes has not yet finished executing at the time the request is received, it will not be able to provide updated data. Second, if a system is not yet ready to receive changes at the end of the clock, the state manager will need to hold onto the changed data until all systems registered to receive it are ready.

The framework provides two state managers for this: for handling changes at the scene level and at the object level. Typically, scene and object messages are independent of each other, so using two separate managers eliminates the need to process unnecessary data. But if the scene needs to take into account the state of an object, it can be registered on to receive notifications of its changes.

In order not to perform unnecessary synchronization, the state manager forms a queue of change notifications separately for each thread created by the task manager. Therefore, no synchronization is required when accessing the queue. Section 2.2 describes a method that can be used to merge queues after execution.

Figure 7. Notice of internal changes universal object

Change notifications do not have to be sent sequentially. There is a way to distribute them in parallel. When performing a task, the system works with all its objects. For example, as physical objects interact with each other, physical system controls their movement, calculation of collisions, new acting forces, etc. When receiving notifications, the system object does not interact with other objects of its system. It interacts with its associated generic object extensions. This means that generic objects are now independent of each other and can be updated at the same time. This approach does not exclude edge cases that should be taken into account in the synchronization process. However, it allows you to use the parallel execution mode when it seemed that you could only act sequentially.

3.2.3. Service Manager

The Service Manager provides systems with access to features of other systems that would otherwise be unavailable to them. It is important to understand that functions are accessed through interfaces and not directly. Information about system interfaces is also stored in the service manager.
To avoid system dependencies on each other, each of them has only a small set of services. In addition, the ability to use a particular service is determined not by the system itself, but by the service manager.

Figure 8. Service manager example

The service manager has another function. It gives systems access to the properties of other systems. Properties are system-specific values that are not passed in the messaging system. This could be an extension of the screen resolution in a graphical system, or a magnitude of gravity in a physical one. The Service Manager gives systems access to such data, but does not allow them to be directly controlled. It puts property changes in a special queue and publishes them only after sequential execution. Please note that access to the properties of another system is rarely required and should not be abused. For example, you may need it to enable and disable the wireframe mode in the graphics system from the console window, or to change the screen resolution at the request of the player from the user interface. This feature is mainly used to set parameters that do not change from frame to frame.

3.2.4. Environment manager

The environment manager provides the runtime environment for the engine. Its functions can be conditionally divided into the following groups.
Variables: names and values of common variables used by all parts of the engine. Typically, variable values are determined when loading a scene or certain user settings. The engine and various systems can access them by sending a request.
Execution: Execution data, such as the completion of a scene or program execution. These parameters can be set and requested by both the systems themselves and the engine.

3.2.5. Platform Manager

The platform manager implements an abstraction for operating system calls and also provides additional functionality beyond the simple abstraction. The advantage of this approach is the encapsulation of several typical functions within a single call. That is, they do not have to be implemented separately for each calling element, overloading it with details about OS calls.
Consider, as an example, calling the platform manager to load the system's dynamic library. It not only boots the system, but also gets function entry points and calls the library's initialization function. The manager also stores the library descriptor and unloads it when the engine terminates.

The platform manager is also responsible for providing information about the processor, such as the supported SIMD instructions, and for initializing a particular mode of operation for processes. Other system query functions cannot be used.

4. Interfaces

Interfaces are the means of interaction between the framework, managers and systems. The framework and managers are part of the engine, so they can interact with each other directly. Systems do not belong to the engine. Moreover, they all perform different functions, which leads to the need to create a single method of interaction with them. Because systems cannot communicate directly with managers, they must provide another way to access them. However, not all functions of managers should be open to systems. Some of them are available only to the framework.

Interfaces define a set of functions required to use a standard access method. This saves the framework from having to know the implementation details of specific systems, since it can only interact with them through a specific set of calls.

4.1. Subject and Observer Interfaces

The main purpose of the subject and observer interfaces is to register which observers to send notifications about which subjects, as well as sending such notifications. Registering and disconnecting from an observer are standard features for all actors included in their interface implementation.

4.2. Manager interfaces

Managers, despite being Singleton objects, are directly accessible only to the framework. Other systems can only access managers through interfaces that represent only a subset of their overall functionality. After initialization, the interface is passed to the system, which uses it to work with certain functions of the manager.
There is no single interface for all managers. Each of them has its own separate interface.

4.3. System interfaces

In order for a framework to access system components, it needs interfaces. Without them the support of each new system engine would have to be implemented separately.
Each system includes four components, so there must be four interfaces. Namely: system, scene, object and task. Detailed description see Section 5, Systems. Interfaces are a means of gaining access to components. The system interfaces allow you to create and delete scenes. Scene interfaces, in turn, allow you to create and destroy objects, as well as request information about the main task of the system. The task interface is primarily used by the task manager when assigning tasks to a thread pool.
Since the scene and the object, as parts of the system, must interact with each other and with the universal scene and the object to which they are attached, their interfaces are also created based on the interfaces of the subject and the observer.

4.4. Change interfaces

These interfaces serve to transfer data between systems. All systems that make changes of a particular type must implement this interface. Geometry is an example. The geometry interface includes methods for determining the position, orientation, and scale of an element. Any system that makes changes to the geometry must implement an interface such that access to the changed data does not require knowledge of other systems.

5. Systems

Systems are the part of the engine that is responsible for implementing the game functionality. They perform all the basic tasks without which the engine would not make sense. Interaction between the engine and systems is done using interfaces (see Section 4.3, “System Interfaces”). This is necessary so as not to overload the engine with information about various types systems. Interfaces make it much easier to add a new system because the engine doesn't need to take into account all the implementation details.

5.1. Types

Engine systems can be loosely divided into several pre-defined categories, corresponding to standard game components. For example: geometry, graphics, physics (collision of rigid bodies), sound, input processing, AI and animation.
Systems with non-standard functions belong to a separate category. It is important to understand that any system that modifies the data of a particular category must be aware of that category's interface, as the engine does not provide such information.

5.2. System Components

For each system, several components need to be implemented. Here are some of them: system, scene, object and task. All of these components interact with various parts engine.
The diagram below depicts the interactions between the various components.

Figure 9. System components

A detailed diagram of the connections between the systems of the engine is given in Appendix B, "The scheme of interaction between the engine and systems."

5.2.1. System

The "system" component, or simply the system, is responsible for initializing system resources, which will practically not change during the operation of the engine. For example, the graphics system parses the addresses of resources to determine their location and speed up loading when using the resource. It also sets the screen resolution.
The system is the main entry point for the framework. It provides information about itself (such as system type) as well as methods for creating and deleting scenes.

5.2.2. Scene

The scene component, or system scene, is responsible for managing the resources that are related to the current scene. The Universal Scene uses system scenes to extend functionality by using their features. An example is a physical scene that is used to create a new game world and determines the gravitational forces in it when the scene is initialized.
Scenes provide methods for creating and destroying objects, as well as a "task" component for processing the scene and a method for accessing it.

5.2.3. An object

The object component, or system object, belongs to the scene and is usually associated with what the user sees on the screen. The generic object uses the system object to extend functionality by exposing its properties as if it were its own.
An example would be the geometric, graphical, and physical extension of a generic object to display a wooden beam on the screen. Geometric properties will include the position, orientation, and scale of the object. To display it, the graphics system will use a special grid. And the physical system will endow it with properties solid body to calculate interactions with other bodies and the acting forces of gravity.

In certain cases, the system object needs to take into account changes to the generic object or one of its extensions. For this purpose, you can create a special link that will allow you to track the changes made.

5.2.4. A task

The task component, or system task, is used to process the scene. The task receives a command to update the scene from the task manager. This is a signal to run system functions on scene objects.
The execution of a task can be divided into subtasks, distributing them also with the help of the task manager into more more streams. This is a convenient way to scale the engine across multiple processors. This method is called data decomposition.
Information about changing objects in the process of updating scene tasks is passed to the state manager. See section 3.2.2 for details on the state manager.

6. Combining all components

All the elements described above are interconnected and are part of one whole. The operation of the engine can be conditionally divided into several stages, described in the following sections.

6.1. Initialization phase

The work of the engine begins with the initialization of managers and the framework.

The framework calls the scene loader.
Having determined which systems the scene will use, the loader calls the platform manager to load the appropriate modules.
The platform manager loads the appropriate modules and passes them to the interface manager, then calls them to create a new system.
The module returns to the loader a pointer to the system instance that implements the system interface.
The service manager registers all the services that the system module provides.

Figure 10. Initialization of managers and systems of the engine

6.2. Scene loading stage

Control is returned to the loader, which loads the scene.

The loader creates a universal scene. To instantiate system scenes, it calls the system interfaces, extending the functionality of the generic scene.
A generic scene defines what data each system scene can change and what changes it should be notified about.
After matching the scenes that make certain changes and want to be notified about them, the generic scene passes this information to the state manager.
For each scene object, the loader creates a generic object, then determines which systems will extend the generic object. Correspondence between system objects is determined according to the same scheme that is used for scenes. It is also passed to the state manager.
Using the resulting scene interfaces, the loader creates instances of system objects and uses them to extend generic objects.
The scheduler asks the scene interfaces for information about their main tasks in order to pass this information to the task manager during execution.

Figure 11. Initialization of the universal scene and object

6.3. Stage of the game cycle

The platform manager is used to process window messages and other elements necessary for the current platform to work.
Control then passes to the scheduler, which waits for the end of the cycle to continue.
At the end of a tick in free stepping mode, the scheduler checks which tasks have been completed. All completed tasks (that is, ready to be executed) are transferred to the task manager.
The scheduler determines which tasks will be completed in the current cycle and waits for them to complete.
In hard stepping mode, these operations are repeated every clock cycle. The scheduler hands over all tasks to the manager and waits for them to complete.

6.3.1. Task completion

Control passes to the task manager.

It forms a queue of all received tasks, then, as free threads appear, it starts executing them. (The process of executing tasks differs depending on the systems. Systems can work with only one task or process several tasks from the queue at the same time, thus realizing parallel execution.)
During execution, tasks can work with the entire scene or only with certain objects, changing their internal data.
Systems should be notified of any changes to shared data (such as position or orientation). Therefore, when a task is executed, the system scene or object informs the observer of any changes. In this case, the observer actually acts as a change controller, which is part of the state manager.
The change controller queues up change notifications for further processing. It ignores changes that do not affect the given observer.
To use certain services, the task calls the service manager. The service manager also allows you to change the properties of other systems that are not available for transmission in the messaging mechanism (for example, the data entry system changes the screen extension - a property of the graphics system).
Tasks can also call the environment manager to obtain environment variables and to change the execution state (suspension of execution, transition to the next scene, etc.).

Figure 12. Task manager and tasks

6.3.2. Data update

After all the tasks of the current cycle have been completed, the main game loop calls the state manager to start the data update phase.

The state manager calls each of its change controllers in turn to distribute accumulated notifications. The controller checks which observers to send change notifications for each of the subjects.
It then calls the desired observer and notifies it of the change (the notification also includes a pointer to the subject's interface). In loose stepping mode, the observer receives the changed data from the change controller, but in hard stepping mode, it must request it from the subject itself.
Typically, the observers interested in receiving system object change notifications are other system objects associated with the same generic object. This allows you to split the process of making changes into several tasks that can be performed in parallel. To simplify the synchronization process, you can combine all related generic object extensions in one task.

6.3.3. Execution check and exit

The final step in the game loop is to check the state of the runtime. There are several such states: work, pause, next scene, etc. If the work state is selected, the next iteration of the loop will be started. The "exit" state means the loop is finished, resources are released, and the application exits. You can implement other states, such as "pause", "next scene", etc.

7. Conclusion

The main idea of this article is given in Section 2, "Concurrent Execution Status". Thanks to functional decomposition and data decomposition, it is possible to realize not only the multithreading of the engine, but also its scalability for even more large quantity cores in the future. To eliminate synchronization overhead while still keeping data up to date, use state managers in addition to the messaging mechanism.

The Observer pattern is a feature of the messaging engine. It is important to have a good understanding of how it works in order to choose the best way to implement it for the engine. In fact, this is a mechanism for interaction between different systems, which ensures the synchronization of common data.

Task management plays an important role in the distribution of workloads. Appendix D provides tips for creating an effective task manager for a game engine.

As you can see, the multithreading of the game engine is possible due to a well-defined structure and message exchange mechanism. With its help, you can significantly improve the performance of modern and future processors.

Appendix A. Engine Schematic

Processing is started from the main game loop (see Figure 4, “Main Game Loop”).

Appendix B. The scheme of interaction between the engine and systems

Appendix C. Observer (design pattern)

The Observer pattern is described in detail in the book Object-Oriented Design Techniques. Design Patterns, Gamma E., Helm R., Johnson R., Vlissides J. It was first published in English in 1995 by Addison-Wesley.

The main idea of this model is the following: if some elements need to be notified about changes to other elements, they do not have to look through the list of all possible changes, trying to find the necessary data in it. The model implies a subject and an observer that are used to send change notifications. The observer keeps track of any changes to the subject. The change controller acts as an intermediary between these two components. The following diagram illustrates this connection.

Figure 13. Pattern "Observer"

The process of using this model is described below.

The change controller registers an observer and a subject about which it wants to be notified.
The change controller is actually an observer. Instead of the observer, together with the subject, he registers himself. The change controller also keeps its list of observers and subjects registered with them.
The subject adds an observer (that is, a change controller) to its list of observers who want to be notified of its changes. Sometimes the type of changes is additionally indicated, which determines which changes the observer is interested in. This allows you to streamline the process of sending change notifications.
When changing data or state, the subject notifies the observer via a callback mechanism and conveys information about the changed types.
The change controller forms a queue of change notifications and waits for a signal to distribute them among objects and systems.
During distribution, the change controller talks to real observers.
Observers request information about the changed data or state from the subject (or receive it along with notifications).
Before deleting an observer, or if it no longer needs to be notified about a subject, it unsubscribes from that subject in the change controller.

There are many different ways to implement task distribution. However, it is best to keep the number of worker threads equal to the number of platform logical processors available. Try not to tie tasks to a specific thread. The execution time of tasks of different systems does not always coincide. This can lead to uneven load distribution between worker threads and affect efficiency. To make this process easier, use task management libraries like