Design principles
The ARM stands for Advanced RISC Machine which is the architecture that is used to build processors. One of the ARM architectures is ARMv6. The development of this architecture was due to the need to enhance the various constraints that have been of great concern for the development of the ARM architecture. The main issue of concern of the Next generation architectures has been the need to meet the needs of the market which keep on evolving. The different parameters like speed, performance, power, and cost should be the determining factors in the design and development of architectures. ARM architecture was developed as it offers better way of optimizing the constraints. One of the success constraints that have been success for the design of ARM architecture has been the performance per power (MIPS/Watts). This is an important aspect in developing future applications (Goodacre, & Sloss, 2005). As computing and communication continue to converge in many consumer markets, there is a need to have these functionalities and needs in the architectures. ARM architecture has been developed to meet wireless needs, networking, and automotive entertainment sector. This architecture has improved memory management. This is seen in the improvement efforts for the power of the processors. With the development of ARMv6, the average instruction fetch and data latency have been reduced significantly. This means that the processor spends less time in waiting for instruction or the caches misses to be loaded. It is researched and recorded that the memory management will increase and optimize the performance by a factor of 30%. Another issue that has been taken into consideration with the development of this architecture is the need to have systems that are multiprocessors. Multiprocessor systems are known to share data efficiently with the use of memory sharing. There is the also the use of multimedia support (Zhang et al. 2009).
Instruction set architecture (ISA)
Thumbs instruction set architecture is supported by this architecture. There is a new technology that has been introduced to the architecture ISA. This is the introduction of 32-bit instructions which is an important part of the Thumb technology. There is the also the support of 16-bit instruction set that is borrowed from ARMv7.
The 16-bit instruction support is the same as the one that was is the same as the Thumb instruction that was used before the introduction of Thumb 2 instructions. The length of the instruction in Thumb is either 16-bit or 32-bit. The alignment of this instruction follows two-byte boundary. There is the ability to intermix 16-bit instructions and 32-bit instructions in this arrangement. Most of the 16-bit instructions are able to access eight registers that are general purpose. These are the R0-R7 registers. They are known as the low registers. There is a small number of registers which are able to access the high registers (R8-R15). The ARM and Thumb instructions are designed so that they are able to interwork in a free manner. ARMv6 supports only Thumb instruction sets (Cormie, 2002).
Instruction categories
There are different categories of instruction sets that are used in ARMv6 architecture. These instructions represent the different instructions that are undertaken by the operands in this architecture.
The first category of instructions is branch instructions. These are the instructions that are used to address the different operations that are undertaken in this category. One operation that is found in this category is branch to target. The range is from +2 to -2 KB. The second example of instruction in this category is that of using to call a subroutine. It has a range of + or -16KB. Another category of instruction set in this architecture is data-processing instructions. This is a group of instructions that are used to perform the basic operations that affect data. One example of data-processing instruction is standard processing instruction which is used to perform basic data instructions. They have a common format although there are some variations in these formats. Other examples of data-processing instructions include multiply instructions, packing and unpacking instructions, and shift instruction. Another category of instructions in this architecture is status register access instructions. This category of the instructions is used to show the status of the register in the system and how they are being used in the architecture.
Examples of instructions that are used in this architecture are the MRS and MSR. These are the instructions that are used to check the status of the register. They are used to check the application program status register to the general purpose register. They can also be used to check from a general purpose register. In the APSR, the setting of the conditions is done by executing the instructions that are used for data processing. In normal circumstances, these instruction sets are used to undertake the control of branch instruction executions.
How the processor works
The processor is enabled to support the ARM and Thumb sets of instructions. It also supports Jazelle technology which enables Java bytecodes to be executed directly. There are also SIMD and DSP instructions that are known to operate 16-bit and 8-bit data values.
In this architecture, there are eight pipeline stages. The first stage is the fetch 1 stage (Fe1). This is the stage where the address is sent and the instruction is received. There is the fetch 2 stage (Fe2). This is the pipeline stage where branch prediction is undertaken. After the shift stage comes the decode instruction stage. This is the stage where instructions are decoded. The fourth pipeline stage is the Issue stage (Iss) where registers are read and instructions are issued. Another pipeline stage is the Shift (Sh) stage where shift operations are undertaken. ALU stage is the pipeline stage where operations are undertaken while Saturate stage is the stage where results are saturated. The last stage is Write back stage. This is the stage where data are written back to the registers.
In this architecture, there are no explicit instructions for the shift pipeline stage. In the shift stage, there is the provision of one field where one operand will be shifted. The architecture has a barrel shifter that is used to perform shifts and undertakes the rotation of the operands. This is the reason why the status register has flags that separately undertake shift and arithmetic overflow.
The processor executes instructions using the different stages in the pipeline. An example of an instruction to be executed is LDR/STR operation. The instruction will hit the first fetch stage in the process of being executed. This is the stage where the address is sent and the instruction is received. After the instruction has been received, the instruction will then proceed to second fetch stage where the prediction of the branch is undertaken. The instruction will then be decoded in the Decode stage. The register will then read and issue the instruction to the next stage where it will be handled by the shifter operation. The writeback value will then be calculated and the saturation stage will be undertaken. After this, there will be the base register writeback. In this stage, there are different stages that can be undertaken depending on the operation that was being executed.
Advantages of ARMv6
This architecture supports some form of parallelism in the execution of instructions. This makes the execution of instructions faster. One way in which this is evident is when some instructions which increment the register and access the memory simultaneously. In this case, they will use the memory access pathway at the same time with the arithmetic pathway. In this case, if there is some stall that is experienced in the memory access pathway, like that caused by another instruction accessing the memory, then arithmetic instruction will proceed anyway. This is because it does not depend on the memory access instruction. This will free the ALU and saturation stage so that it can be used by the other instructions. With this architecture, there is the support of arithmetic which is 8 and 16-bit. This is SIMD arithmetic. There is also the inclusion of four 8-bit and two 16-bit operations. Also in this architecture, there is the inclusion of add and subtraction instructions which operate in parallel. Other operations included are selection, packing and unpacking.
The buses that are used in this architecture include bus which has widths of above 64-bits. With this capability, it is able to support throughput which is equivalent to a machine of 64-bits. It can even be better. Although this is the case, there is no power or area overhead that is seen in a full 64-bit CPU. There is efficient bus usage because of the improvements in memory management of this architecture. There is more power savings because of the reduced activity of the buses. The power savings is because the memory access has been greatly reduced.
The hardware that is used for caching memory uses virtual-index and virtual-tagged cache; the hardware in this situation is based on tags which are based on each virtual address. One benefit of this method of undertaking memory caching is that lookups for caching is made faster because the translation look-aside buffer (TLB) is not part of the matching caching files which is used for a virtual address.
The input/output subsystem of the architecture is handled and managed using different rules that are used for access. The I/O subsystem has a different set of instructions that is used to manage the process and the system. The memory that the input/output subsystem conforms to is the Strongly-ordered or device memory. The examples of memory accesses include FIFO. FIFO accesses are used to access consecutive add, write, remove, read or write values. There is also the handling of the interrupt controllers in the input/out subsystem. The interrupt controller register is used. In this case, an access can be utilized as an interrupt acknowledge; it will change the state of the controller itself. There are also configurations for memory controllers. These are set up the timing of areas of normal memory. There are also peripherals that are used for memory mapping; this weakens the system when memory locations are altered (Carbone, 2005).
There is also Sharable attribute which is used to indicate whether Normal or Device memory is used as private to a single processor. There is also the possibility of having access from many processors or other resources that are of bus-master category. They include intelligent peripheral by making use of DMA capability (York, 2002).
In the management of input/output devices, there is the need to use Strongly-ordered memory because there is need to have access in an ordered manner. This is relative to whatever action and activity that has occurred in program order before the program had taken place and immediately after the program takes place. In this category, the Strongly-ordered memory will always assume that the resource is sharable. For input/output devices which are shared by many processors, there is the use of Device memory attribute. This is the case with the use of this type of memory.
Successes of ARMv6 architecture
The architecture is successful. This is seen in the many benefits that were achieved with the implementation of this architecture. There are many improvements that have been achieved with the use of this architecture. There is the improvement of memory with this architecture. With this improvement, there was improvement in the processor power. This is true for applications which dependent on the platform that is used. In these applications, the operating system needs to manage tasks which keep on changing. There the reduction of average fetches time and data latency with the use of this architecture. The processor spends minimal time in fetching instructions from the memory.
There are two issues that are undertaken with the use of memory management in ARMv6 and in general architecture development. One of the issues of concern is the process of translating virtual memory to physical memory. There is also the issue of protecting different processes and tasks in the given levels that are required. ARMv6 is an architecture that follows the load-store. This is an architecture where the instructions can only operate on data in registers that are part of the core. The load and store instructions are used in the transportation of data to and from the register file of concern (Shojania, & Li, 2009).
A multi-level memory is a memory system that is part of normal system design architecture hierarchy. With the use of this memory hierarchy, ARMv6 tends to run faster. There are no wait states in level 1 a memory system which is used by ARMv6 architecture. Practically, this will put a limit on the number of memories that can be supported at core clock speeds of the processor. Larger and higher performing systems are able to support increased caches which have some state waits but have less latency. This is due to the developments of ARMv6 architecture.
ARMv6 is a progression and enhancement of the cores which was introduced in ARM architecture. There was the introduction of cached cores in ARM with the use of MMU. ARMv6 is an advanced version of this where there is complete definition of L1 memory system. There is less definition of the memory levels of this and how they are supposed to interact.
With the complete definition of L1 memory, there will be the synchronization of L1 memory to the core. In this level, there will be different levels of domain which will be introduced to the design, and the synchronization of the memory will now become dependent on the implementation process (Klein et al., 2009).
Another success with ARMv6 architecture is the complete specification of L1 memory cache which is most tightly coupled to the processor. This is achieved with the use of ARM Virtual Memory System Architecture v6 (VMSAv6). There is also the ability of the VMSAv6 to specify Tightly-Coupled Memory (TCM) and DMA system. There is the possibility and abilities to implement a collection of these systems in ARMv6 architecture. There is also the use of software which has software-visible registers that are meant to allow the resources that are in existence to be identified. V6 supports the hierarchy and memory ordering rules which are used to ensure that there is additional level of correctness. This is also used to achieve additional levels of cache in the system for both single processor and multi-processors systems. This ability of memory ordering is used to give a complete definition of the architecture, without bringing any constraints to the implementation process (Blanqui et al., 2011).
The success of ARMv6 is also seen with the ability to support caches which are physically tagged. This ability reduces software overheads on the switches context. This feature can save the utilization power of the processor by a factor of 20%; this is achieved by eradicating the need to undertake cache flushing by the operating system.
The development of the ARMv6 was also undertaken to reduce the need to have cache clean and occurrence of invalidation on the context switch. This is achieved by organizing the cache using Harvard system where there are separate caches that are used for data management. It can also be organized by having it as a single von Neumann architecture cache. In this case, the TCM is a region in scratchpad memory which is implemented together with the L1 memory cache. Also, the L1 cache can be organized in such a way that they follow the Harvard System or von Neumann System. There is also the design of L1 DMA subsystem so that it enables transfers taking place in the background to and from the TCM.
Another benefit and advantage with the development of this architecture is in exceptions and interrupts handling. There are different enhancements which have been enabled in this architecture. There is an improvement in the interrupt latency in this architecture. The low latency interrupt mode is one improvement which enables the modification or switching off of features. This ability is enabled with the use of F1 bit which is possible by making use of FI bit in the CP15 register 1; this is the CPU control register. This ability enables designers to choose between performance and latency in an efficient way. They will also be able to support both of these features while undertaking their design. One example of this is that of making Load Multiple or Store Multiple instructions interruptible where it is important to achieve low latency. In normal circumstances, these instructions could be made to run up to completion.
Conclusion
In conclusion, there are many benefits that come with this architecture. There have been massive improvements with the development of ARMv6 where instructions and memory management have been greatly improved. With the changing needs of users, there has been the need to have the effort to achieve better management of the memory and interrupt management. This has been made possible with this design architecture.
References
Blanqui, F., Helmstetter, C., Joloboff, V., Monin, J. F., & Shi, X. (2011). Designing a CPU model: from a pseudo-formal document to fast code. arXiv preprint arXiv:1109.4351.
Carbone, J. (2005). A SMP RTOS for the ARM MPCore Multiprocessor. ARM Information Quaterly, 4(3), 64-67.
Cormie, D. (2002). The ARM11 microarchitecture. ARM Ltd. White Paper.
Goodacre, J., & Sloss, A. N. (2005). Parallelism and the ARM instruction set architecture. Computer, 38(7), 42-50.
Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Cock, D., Derrin, P., & Winwood, S. (2009, October). seL4: Formal verification of an OS kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (pp. 207-220). ACM.
Shojania, H., & Li, B. (2009, June). Random network coding on the iPhone: fact or fiction?. In Proceedings of the 18th international workshop on Network and operating systems support for digital audio and video (pp. 37-42). ACM.
York, R. (2002). Benchmarking in context: Dhrystone. White Paper. ARM Ltd., Cambridge, UK.
Zhang, L., Wen, S., Wang, R., & Zhang, G. (2009, May). A system architecture design scheme of the secure chip based on SoC. In Intelligent Systems and Applications, 2009. ISA 2009. International Workshop on (pp. 1-4). IEEE.