General-Purpose Annals

Cortex-M3 Basics

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2010

3.1 Registers

As we've seen, the Cortex™-M3 processor has registers R0 through R15 and a number of special registers. R0 through R12 are general purpose, only some of the 16-fleck Thumb® instructions can only access R0 through R7 (low registers), whereas 32-bit Thumb-2 instructions tin admission all these registers. Special registers have predefined functions and can only exist accessed by special register access instructions.

3.ane.1 General Purpose Registers R0 through R7

The R0 through R7 general purpose registers are also called low registers. They tin can be accessed by all 16-bit Pollex instructions and all 32-bit Thumb-2 instructions. They are all 32 bits; the reset value is unpredictable.

3.ane.2 General Purpose Registers R8 through R12

The R8 through R12 registers are also called high registers. They are attainable by all Thumb-ii instructions simply not by all 16-flake Thumb instructions. These registers are all 32 $.25; the reset value is unpredictable (run into Figure 3.i).

FIGURE iii.1. Registers in the Cortex-M3.

3.1.3 Stack Pointer R13

R13 is the stack pointer (SP). In the Cortex-M3 processor, in that location are two SPs. This duality allows two split stack memories to be ready. When using the register proper name R13, you can only access the current SP; the other i is inaccessible unless you use special instructions to move to special register from full general-purpose annals (MSR) and movement special register to general-purpose register (MRS). The two SPs are every bit follows:

Chief Stack Pointer (MSP) or SP_main in ARM documentation: This is the default SP; it is used by the operating system (Bone) kernel, exception handlers, and all application codes that crave privileged access.

Procedure Stack Pointer (PSP) or SP_process in ARM documentation: This is used past the base of operations-level application code (when non running an exception handler).

Stack PUSH and POP

Stack is a memory usage model. It is simply part of the system retentiveness, and a pointer register (inside the processor) is used to go far work as a showtime-in/last-out buffer. The common use of a stack is to relieve register contents before some data processing and so restore those contents from the stack afterward the processing task is washed.

Effigy iii.ii. Basic Concept of Stack Memory.

When doing Button and POP operations, the pointer register, normally called stack pointer, is adjusted automatically to forestall adjacent stack operations from corrupting previous stacked information. More details on stack operations are provided on later part of this chapter.

It is not necessary to use both SPs. Simple applications can rely purely on the MSP. The SPs are used for accessing stack retentiveness processes such as Button and POP.

In the Cortex-M3, the instructions for accessing stack memory are PUSH and Pop. The associates language syntax is as follows (text after each semicolon [;] is a comment):

PUSH   {R0}   ; R13=R13-4, and so Memory[R13] = R0

POP   {R0}   ; R0 = Memory[R13], then R13 = R13 + 4

The Cortex-M3 uses a total-descending stack organisation. (More particular on this discipline can be plant in the "Stack Retentiveness Operations" department of this chapter.) Therefore, the SP decrements when new data is stored in the stack. PUSH and POP are ordinarily used to save register contents to stack memory at the kickoff of a subroutine and so restore the registers from stack at the end of the subroutine. Yous tin Button or Popular multiple registers in one instruction:

subroutine_1

  Push button   {R0-R7, R12, R14} ; Save registers

  ...   ; Do your processing

  POP   {R0-R7, R12, R14} ; Restore registers

  BX   R14   ; Render to calling function

Instead of using R13, you can apply SP (for SP) in your program codes. It means the same matter. Inside program code, both the MSP and the PSP tin can exist called R13/SP. Notwithstanding, you tin access a particular i using special annals admission instructions (MRS/MSR).

The MSP, also called SP_main in ARM documentation, is the default SP later power-up; information technology is used by kernel code and exception handlers. The PSP, or SP_process in ARM documentation, is typically used past thread processes in system with embedded Bone running.

Because register PUSH and POP operations are always word aligned (their addresses must be 0x0, 0x4, 0x8, ...), the SP/R13 flake 0 and scrap ane are hardwired to 0 and always read as zero (RAZ).

3.1.4 Link Register R14

R14 is the link register (LR). Inside an assembly program, y'all tin can write it as either R14 or LR. LR is used to shop the render program counter (PC) when a subroutine or function is chosen—for example, when you're using the branch and link (BL) pedagogy:

principal   ; Main program

  ...

  BL function1 ; Phone call function1 using Branch with Link instruction.

  ; PC = function1 and

  ; LR = the next instruction in principal

  ...

function1

  ...   ; Plan lawmaking for function 1

  BX LR   ; Render

Despite the fact that chip 0 of the PC is always 0 (considering instructions are word aligned or one-half word aligned), the LR bit 0 is readable and writable. This is because in the Pollex instruction set, fleck 0 is often used to bespeak ARM/Thumb states. To allow the Pollex-ii program for the Cortex-M3 to work with other ARM processors that support the Pollex-two engineering, this least significant chip (LSB) is writable and readable.

iii.ane.5 Programme Counter R15

R15 is the PC. You tin can access information technology in assembler code past either R15 or PC. Because of the pipelined nature of the Cortex-M3 processor, when you read this register, y'all will observe that the value is different than the location of the executing education, normally by iv. For example:

0x1000 :   MOV   R0, PC   ; R0 = 0x1004

In other instructions like literal load (reading of a memory location related to electric current PC value), the constructive value of PC might non be instruction address plus four due to alignment in address adding. But the PC value is yet at least ii bytes ahead of the instruction accost during execution.

Writing to the PC will cause a co-operative (but LRs do not get updated). Considering an instruction accost must exist half give-and-take aligned, the LSB (bit 0) of the PC read value is always 0. However, in branching, either past writing to PC or using branch instructions, the LSB of the target address should be gear up to 1 because it is used to bespeak the Thumb state operations. If it is 0, it can imply trying to switch to the ARM state and will event in a fault exception in the Cortex-M3.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9781856179638000065

INTRODUCTION TO THE ARM INSTRUCTION Fix

ANDREW N. SLOSS , ... CHRIS WRIGHT , in ARM System Developer'due south Guide, 2004

iii.v Plan Condition REGISTER INSTRUCTIONS

The ARM education set up provides two instructions to direct control a program condition annals (psr). The MRS instruction transfers the contents of either the cpsr or spsr into a register; in the contrary management, the MSR instruction transfers the contents of a annals into the cpsr or spsr. Together these instructions are used to read and write the cpsr and spsr.

In the syntax you tin see a label called fields. This tin can be any combination of control (c), extension (x), status (s), and flags (f). These fields chronicle to particular byte regions in a psr, every bit shown in Figure 3.9.

Effigy 3.ix. psr byte fields.

MRS copy plan status annals to a full general-purpose register Rd = psr
MSR move a full general-purpose annals to a program condition register psr[field] = Rm
MSR movement an immediate value to a program condition register psr[field] = immediate

The c field controls the interrupt masks, Pollex state, and processor style. Example iii.26 shows how to enable IRQ interrupts past immigration the I mask. This operation involves using both the MRS and MSR instructions to read from and then write to the cpsr.

EXAMPLE 3.26

The MSR outset copies the cpsr into register r1. The BIC educational activity clears chip 7 of r1. Register r1 is then copied dorsum into the cpsr, which enables IRQ interrupts. You can see from this instance that this code preserves all the other settings in the cpsr and simply modifies the I fleck in the command field.

This case is in SVC mode. In user fashion yous can read all cpsr bits, just y'all tin can merely update the condition flag field f.

3.5.1 COPROCESSOR INSTRUCTIONS

Coprocessor instructions are used to extend the educational activity set. A coprocessor tin can either provide additional computation capability or be used to control the memory subsystem including caches and memory management. The coprocessor instructions include data processing, register transfer, and retentivity transfer instructions. Nosotros will provide but a short overview since these instructions are coprocessor specific. Note that these instructions are only used past cores with a coprocessor.

CDP coprocessor information processing—perform an operation in a coprocessor
MRC MCR coprocessor register transfer—motility data to/from coprocessor registers
LDC STC coprocessor retentiveness transfer—load and shop blocks of retentivity to/from a coprocessor

In the syntax of the coprocessor instructions, the cp field represents the coprocessor number betwixt p0 and p15. The opcode fields describe the operation to take identify on the coprocessor. The Cn, Cm, and Cd fields describe registers inside the coprocessor. The coprocessor operations and registers depend on the specific coprocessor you are using. Coprocessor 15 (CP15) is reserved for organization control purposes, such as memory direction, write buffer control, cache control, and identification registers.

Instance three.27

This example shows a CP15 register existence copied into a general-purpose annals.

Hither CP15 register-0 contains the processor identification number. This register is copied into the general-purpose register r10.

3.5.2 COPROCESSOR xv Educational activity SYNTAX

CP15 configures the processor cadre and has a set up of dedicated registers to store configuration information, every bit shown in Example 3.27. A value written into a annals sets a configuration attribute—for case, switching on the enshroud.

CP15 is called the system control coprocessor. Both MRC and MCR instructions are used to read and write to CP15, where annals Rd is the cadre destination register, Cn is the primary register, Cm is the secondary annals, and opcode2 is a secondary register modifier. You may occasionally hear secondary registers called "extended registers."

As an case, here is the pedagogy to movement the contents of CP15 control register c1 into register r1 of the processor core:

We employ a shorthand annotation for CP15 reference that makes referring to configuration registers easier to follow. The reference notation uses the following format:

The first term, CP15, defines information technology as coprocessor 15. The second term, later the separating colon, is the master register. The primary register X tin have a value between 0 and 15. The third term is the secondary or extended register. The secondary annals Y tin have a value between 0 and fifteen. The last term, opcode2, is an instruction modifier and can have a value between 0 and vii. Some operations may also apply a nonzero value w of opcode1. Nosotros write these as CP15:due west:cX:cY:Z.

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9781558608740500046

Overview of the Cortex-M3

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2010

ii.two Registers

The Cortex-M3 processor has registers R0 through R15 (see Figure 2.ii). R13 (the stack pointer) is banked, with only 1 copy of the R13 visible at a fourth dimension.

FIGURE 2.ii. Registers in the Cortex-M3.

2.ii.1 R0–R12: General-Purpose Registers

R0–R12 are 32-chip full general-purpose registers for data operations. Some 16-chip Pollex ® instructions can just admission a subset of these registers (low registers, R0–R7).

2.2.2 R13: Stack Pointers

The Cortex-M3 contains two stack pointers (R13). They are banked so that only one is visible at a time. The two stack pointers are as follows:

Main Stack Pointer (MSP): The default stack pointer, used by the operating organization (Os) kernel and exception handlers

Process Stack Pointer (PSP): Used by user application lawmaking

The lowest 2 bits of the stack pointers are always 0, which means they are always word aligned.

2.2.3 R14: The Link Register

When a subroutine is called, the return address is stored in the link register.

2.2.4 R15: The Plan Counter

The program counter is the current plan address. This register tin can be written to control the program menstruation.

2.2.5 Special Registers

The Cortex-M3 processor as well has a number of special registers (see Figure 2.3). They are as follows:

Program Status registers (PSRs)

Interrupt Mask registers (PRIMASK, FAULTMASK, and BASEPRI)

Control register (CONTROL)

Figure 2.iii. Special Registers in the Cortex-M3.

These registers have special functions and can be accessed just by special instructions. They cannot be used for normal data processing (see Table 2.1).

Table two.one. Special Registers and Their Functions

Register Role
xPSR Provide arithmetics and logic processing flags (naught flag and carry flag), execution status, and current executing interrupt number
PRIMASK Disable all interrupts except the nonmaskable interrupt (NMI) and hard error
FAULTMASK Disable all interrupts except the NMI
BASEPRI Disable all interrupts of specific priority level or lower priority level
Command Ascertain privileged status and stack pointer selection

For more than information on these registers, encounter Chapter 3.

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9781856179638000053

Early on Intel® Architecture

In Power and Operation, 2015

1.one.2 Registers

Bated from the four segment registers introduced in the previous section, the 8086 has seven general purpose registers, and two status registers.

The general purpose registers are divided into two categories. Four registers, AX, BX, CX, and DX, are classified every bit information registers. These information registers are accessible as either the full 16-bit register, represented with the X suffix, the low byte of the full sixteen-fleck register, designated with an 50 suffix, or the loftier byte of the 16-flake register, delineated with an H suffix. For instance, AX would access the full 16-fleck register, whereas AL and AH would admission the annals'southward low and high bytes, respectively.

The 2d nomenclature of registers are the pointer/index registers. This includes the following 4 registers: SP, BP, SI, and DI, The SP annals, the stack pointer, is reserved for usage as a pointer to the top of the stack. The SI and DI registers are typically used implicitly as the source and destination pointers, respectively. Different the data registers, the pointer/index registers are simply accessible as total 16-flake registers.

As this categorization may indicate, the full general purpose registers come up with some guidance for their intended usage. This guidance is reflected in the instruction forms with implicit operands. Instructions with implicit operands, that is, operands which are causeless to be a sure register and therefore don't require that operand to be encoded, allow for shorter encodings for common usages. For convenience, instructions with implicit forms typically besides have explicit forms, which crave more bytes to encode. The recommended uses for the registers are as follows:

AX Accumulator

BX Data (relative to DS)

CX Loop counter

DX Data

SI Source pointer (relative to DS)

DI Destination pointer (relative to ES)

SP Stack pointer (relative to SS)

BP Base arrow of stack frame (relative to SS)

Aside from allowing for shorter education encodings, this guidance is also an aid to the programmer who, once familiar with the various register meanings, volition be able to deduce the meaning of assembly, assuming it conforms to the guidelines, much faster. This parallels, to some degree, how variable names help the programmer reason well-nigh their contents. It'due south important to notation that these are just suggestions, not rules.

Additionally, in that location are two status registers, the pedagogy pointer and the flags register.

The education pointer, IP, is likewise often referred to as the program counter. This register contains the retentiveness accost of the side by side teaching to be executed. Until 64-bit mode was introduced, the instruction pointer was not directly accessible to the developer, that is, it wasn't possible to access information technology like the other full general purpose registers. Despite this, the educational activity pointer was indirectly attainable. Whereas the instruction arrow couldn't exist modified through a MOV pedagogy, information technology could be modified by any education that alters the program flow, such as the CALL or JMP instructions.

Reading the contents of the instruction pointer was also possible by taking advantage of how x86 handles function calls. Transfer from ane function to another occurs through the Telephone call and RET instructions. The CALL instruction preserves the current value of the instruction pointer, pushing it onto the stack in guild to support nested function calls, and and then loads the instruction pointer with the new address, provided every bit an operand to the teaching. This value on the stack is referred to as the return address. Whenever the office has finished executing, the RET instruction pops the return address off of the stack and restores it into the didactics arrow, thus transferring control back to the office that initiated the function call. Leveraging this, the programmer can create a special thunk function that would just re-create the return value off of the stack, load it into i of the registers, and so return. For example, when compiling Position-Contained-Code (Picture), which is discussed in Affiliate 12, the compiler will automatically add functions that utilize this technique to obtain the instruction arrow. These functions are normally chosen __x86.get_pc_thunk.bx(), __x86.get_pc_thunk.cx(), __x86.get_pc_thunk.dx(), then on, depending on which annals the instruction pointer is loaded.

The 2d status register, the EFLAGS register, is comprised of ane-flake status and control flags. These $.25 are set by diverse instructions, typically arithmetics or logic instructions, to indicate certain conditions. These condition flags tin can and then be checked in society to make decisions. For a list of the flags modified by each instruction, meet the Intel SDM. The 8086 defined the following status and control bits in EFLAGS:

Zero Flag (ZF) Set if the result of the instruction is zero.

Sign Flag (SF) Set if the result of the teaching is negative.

Overflow Flag (OF) Set if the outcome of the instruction overflowed.

Parity Flag (PF) Set if the outcome has an even number of bits set.

Carry Flag (CF) Used for storing the carry bit in instructions that perform arithmetics with behave (for implementing extended precision).

Adjust Flag (AF) Similar to the Deport Flag. In the parlance of the 8086 documentation, this was referred to as the Auxiliary Carry Flag.

Management Flag (DF) For instructions that either autoincrement or autodecrement a arrow, this flag chooses which to perform. If set, autodecrement, otherwise autoincrement.

Interrupt Enable Flag (IF) Determines whether maskable interrupts are enabled.

Trap Flag (TF) If set CPU operates in unmarried-step debugging mode.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B978012800726600001X

Intel® Pentium® Processors

In Ability and Functioning, 2015

ii.2.iii Out-of-Order Execution

As discussed in Section ii.ane.1, prior to the 80486, the processor handled one instruction at a time. As a result, the processor's resources remained idle while the currently executing education was not utilizing them. With the introduction of pipelining, the pipeline was partitioned to permit multiple instructions to coexist simultaneously. Therefore, when the currently executing education had finished with some of the processor'southward resource, the next instruction could brainstorm utilizing them before the commencement instruction had completely finished executing. The introduction of μops expanded significantly on this concept, splitting education execution into smaller steps.

Each type of μop has a corresponding type of execution unit. The Pentium Pro has v execution units: two for handling integer μops, two for treatment floating point μops, and 1 for treatment retentivity μops. Therefore, up to five μops can execute in parallel. An instruction, divided into one or more μops, is not washed executing until all of its corresponding μops have finished. Obviously, μops from the same pedagogy have dependencies upon one another so they can't all execute simultaneously. Therefore, μops from multiple instructions are dispatched to the execution units.

Taking advantage of the fine granularity of μops, out-of-order execution significantly improves utilization of the execution units. Up until the Pentium Pro, Intel processors executed in-order, pregnant that instructions were executed in the same sequence every bit they were organized in retentivity. With out-of-club execution, μops are scheduled based on the available resources, as opposed to their ordering. As instructions are fetched and decoded, the resulting μops are stored in the Reorder Buffer. Every bit execution units and other resources become available, the Reservation Station dispatches the corresponding μop to one of the execution units. In one case the μop has finished executing, the consequence is stored dorsum into the Reorder Buffer. Once all of the μops associated with an education have completed execution, the μops retire, that is, they are removed from the Reorder Buffer and whatever results or side-furnishings are made visible to the residual of the organisation. While instructions can execute in any social club, instructions always retire in-society, ensuring that the programmer does not demand to worry about handling out-of-order execution.

To illustrate the problem with in-order execution and the benefit of out-of-social club execution, consider the following hypothetical situation. Assume that a processor has ii execution units capable of treatment integer μops and i capable of handling floating point μops. With in-order scheduling, the nearly efficient usage of this processor would be to intermix integer and floating betoken instructions post-obit the two-to-i ratio. This would involve carefully scheduling instructions based on their instruction latencies, forth with the latencies for fetching whatsoever retentivity resources, to ensure that when an execution unit becomes available, the next μop in the queue would be executable with that unit.

For example, consider four instructions scheduled on this instance processor, three integer instructions followed past a floating point instruction. Presume that each education corresponds to i μop, that these instructions take no interdependencies, and that all 3 execution units are currently available. The first two integer instructions would be dispatched to the ii available integer execution units, simply the floating signal education would not exist dispatched, fifty-fifty though the floating indicate execution unit was available. This is considering the tertiary integer instruction, waiting for ane of the two integer execution units to become bachelor, must be issued kickoff. This underutilizes the processor's resources. With out-of-club execution, the first ii integer instructions and the floating bespeak instruction would be dispatched together.

In other words, out-of-order execution improves the utilization of the processor'southward resources. Additionally, because μops are scheduled based on available resource, some instruction latencies, such as an expensive load from retentivity, may be partially or completely masked if other work can be scheduled instead.

Annals Renaming

From the teaching set perspective, Intel processors have eight general purpose registers in 32-bit mode, and sixteen general purpose registers in 64-bit fashion, still, from the internal hardware perspective, Intel processors accept many more than registers. For example, the Pentium Pro has forty registers, organized in a structure referred to every bit a Physical Annals File.

While this many extra registers might seem like a performance boon, peculiarly if the reader is familiar with the performance gain received from the eight actress registers in 64-fleck mode, these registers serve a unlike purpose. Rather than providing the procedure with more registers, these actress registers serve to handle data dependencies in the out-of-order execution engine.

When a value is stored into a register, a new register file entry is assigned to contain that value. One time some other value is stored into that annals, a different register file entry is assigned to contain this new value. Internal to the processor cadre, each data dependency on the start value will reference the first entry, and each information dependency on the 2nd value will reference the 2nd entry. Therefore, the out-of-social club engine is able to execute instructions in an club that would otherwise exist impossible due to false information dependencies.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128007266000021

Load/store and co-operative instructions

Larry D. Pyeatt , William Ughetta , in ARM 64-Bit Assembly Language, 2020

iii.two AArch64 user registers

Equally shown in Fig. 3.2 , the AArch64 ISA provides 31 general-purpose registers, which are called

Image 2

through

Image 3

. These registers can each store 64 bits of information. To employ all 64 bits, they are referred to every bit

Image 4

through

Image 5

(capitalization is optional). To use merely the lower (least significant) 32 $.25, they are referred to as

Image 6

. Since each annals has a 64-flake name and a 32-bit name, we use

Image 7

through

Image 8

to specify a register without specifying the number of bits. For example, when nosotros refer to

Image 9

, we are really referring to either

Image 10

or

Image 11

.

Figure 3.2

Figure 3.two. AArch64 full general purpose registers (

Image 1
) and special registers.

3.2.1 General purpose registers

The general-purpose registers are each used co-ordinate to specific conventions. These rules are defined in the application binary interface (ABI). The AArch64 ABI is chosen AAPCS64. The difference between callee saved and caller saved registers will also exist explained in Section v.four.four.

Registers

Image 12
are used for passing arguments when calling a procedure or part Registers
Image 13
are scratch registers and can be used at whatever time because no assumptions are made well-nigh what they contain. They are called scratch registers considering they are useful for holding temporary results of calculations. Registers
Image 14
can also be used as scratch registers, just their contents must be saved before they are used, and restored to their original contents before the procedure exits.

Some of the registers have alternate names. For example,

Image 15
is also known equally
Image 16
. About of these alternate names are but of interest to people writing compilers and operating systems. However, ii of these registers are of interest to all AArch64 programmers.

3.ii.2 Frame pointer

The frame arrow,

Image 17
, is used by high-level language compilers to track the current stack frame. This register can be helpful when the program is running under a debugger, and can sometimes help the compiler to generate more than efficient lawmaking for returning from a subroutine. The GNU C compiler can exist instructed to use
Image 17
every bit a general-purpose register by using the –fomit-frame-pointer command line option. The utilize of
Image 17
as the frame arrow is a programming convention. Some instructions (e.g. branches) implicitly alter the program counter, the link annals, and fifty-fifty the stack pointer, so they are considered to be hardware special registers. Every bit far every bit the hardware is concerned, the frame pointer is exactly the same as the other full general-purpose registers, but AArch64 programmers use it for the frame pointer because of the ABI.

three.2.3 PSTATE register

The

Image 18

annals contains bits that signal the status of the current process, including data about the results of previous operations. Fig. 3.iii shows all of its bits. The dashed lines indicate unused infinite that may exist reserved for future AArch64 architectural extensions. The

Image 18

annals is actually a collection of independent fields, most of which are only used by the operating arrangement. User programs make use of the first four bits, N, Z, C, and V. These are referred to as the condition flags field. Most instructions can modify these flags, and later instructions can use the flags to control their operation. Their meaning is equally follows:

Negative:

This bit is set to one if the signed result of an operation is negative, and gear up to zero if the result is positive or nil.

Zero:

This fleck is set to one if the result of an operation is zero, and set to nada if the event is non-zero.

Behave:

This bit is ready to one if an add operation results in a behave out of the virtually significant bit, or if a subtract performance results in a borrow. For shift operations, this flag is set to the last bit shifted out by the shifter.

oVerflow:

For addition and subtraction, this flag is set if a signed overflow occurred.

Figure 3.3

Figure three.3. Fields in the PSTATE register.

three.ii.4 Link register

The procedure link annals,

Image 5
, is used to hold the return address for subroutines. Certain instructions cause the program counter to exist copied to the link register, then the program counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.five and in more detail in Section 5.4. The link annals could theoretically be used as a scratch register, but its contents are modified past hardware when a subroutine is chosen, in gild to save the correct return address. Using
Image 5
as a full general-purpose register is unsafe and is strongly discouraged.

3.two.v Stack pointer

The program stack was introduced in Section 1.4. The stack arrow,

Image 19
, is used to concord the accost where the stack ends. This is commonly referred to as the top of the stack, although on well-nigh systems the stack grows downwards and the stack pointer really refers to the everyman accost in the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automated variables) are allocated or deleted. The utilise of the stack for storing automatic variables is described in Affiliate 5. The stack pointer can only be modified or read past a modest prepare of instructions.

3.2.6 Naught register

The zero annals,

Image 20
, can be referred to as a 64-bit annals,
Image 21
, or a 32-chip register,
Image 22
. Information technology e'er has the value zero. Most instructions can employ the zip register as an operand, even as a destination register. If this is the case, the didactics will not change the destination annals. However, it tin can still have side furnishings, including updating the
Image 18
flags based on the ALU operation and incrementing a register in pre-indexed or post-indexed addressing. The nothing register cannot always be used as an operand. It shares the same binary encoding with the stack arrow annals,
Image 19
, which is the value
Image 23
. Some instructions can access the zero register, while others tin admission the stack pointer.

iii.2.7 Program counter

The programme counter,

Image 24
, always contains the accost of the next instruction that will be executed. The processor increments this annals by four, automatically, after each instruction is fetched from memory. By moving an address into this annals, the programmer can crusade the processor to fetch the next instruction from the new address. This gives the programmer the power to jump to any address and begin executing code there. Only a minor number of instructions can admission the
Image 24
directly. For instance instructions that create a PC-relative address, such as
Image 25
, and instructions which load a register, such as
Image 26
, are able to access the program counter directly.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128192214000109

Knights Landing architecture

Jim Jeffers , ... Avinash Sodani , in Intel Xeon Phi Processor High Operation Programming (Second Edition), 2016

Integer execution unit of measurement

The IEU executes integer μops, which are defined every bit those that operate on general-purpose registers R0–R15 (i.e., RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R8…R15). There are two IEUs in the core. Each IEU contains 12-entry RS that problems one μop per bicycle. The Integer RSes are fully out-of-order in their scheduling. Most operations have 1-wheel latency and are supported by both IEUs, merely a few operations accept 3- or 5-cycles latency (e.1000., multiplies) and are only supported by ane of the IEUs.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128091944000041

Computer Data Processing Hardware Compages

Paul J. Fortier , Howard East. Michel , in Computer Systems Performance Evaluation and Prediction, 2003

2.iii.1 Educational activity types

Based on the number of registers available and the configuration of these registers several types of didactics are possible—for example, if many registers are available, as would be the case in a stack computer, no address computations are needed and the pedagogy, therefore, tin be much shorter both in format and execution time required. On the other paw, if there are no general registers and all computations are performed by retentivity movements of data, so instructions will be longer and require more time due to operand fetching and storage. The following are representative of teaching types:

0-address instructions—This type of instruction is institute in machines where many full general-purpose registers are bachelor. This is the case in stack machines and in some reduced instruction gear up machines. Instructions of this type perform their function totally using registers. If we take three general registers, A, B, and C, a typical format would accept the grade:

(ii.1) R [ A ] < R [ B ] operator R [ C ]

which indicates that the contents of registers B and C have the operator (such every bit add together, decrease, multiply, etc.) performed on them, with the effect stored in full general register C. Similarly, we could draw instructions that apply just one or two registers as follows:

(2.2) R [ B ] < R [ B ] operator R [ C ]

or

(2.3) operator R [ C ]

which represents two-annals and one-register instructions, respectively. In the two-register case one of the operand registers is also used equally the event register. In the single-register instance the operand register is also the effect register. The increment didactics is an example of 1-register instruction. This type of instruction is found in all machines.

1-address instructions—In this type of instruction a single retention address is constitute in the instruction. If another operand is used, information technology is typically an accumulator or the peak of a stack in a stack figurer. The typical format of these instructions has the form:

(ii.4) operator Thousand [ address ]

where the contents of the named memory address have the named operator performed on them in conjunction with an implied special register. An example of such an pedagogy could be every bit follows:

(2.v) Motility G [ 100 ]

or

(2.6) Add Thou [ 100 ]

which moves the contents of memory location 100 into the ALU's accumulator or adds the contents of retentiveness address 100 with the accumulator and stores the event in the accumulator. If the result must be stored in memory, we would need a store teaching:

(2.seven) Store M [ 100 ]

1-and-l/two-address instructions—Once we take an architecture that has some general-purpose registers, nosotros tin provide more advanced operations combining retention contents and the general registers. The typical didactics performs an operation on a retention location's contents with that of a general register—for example, nosotros could add the contents of a memory location with the contents of a full general annals, A, as shown:

(2.8) Add R [ A ] , Chiliad [ 100 ]

This instruction typically stores the result in the outset named location or register in the instruction. In this example information technology is annals A.

2-address instructions—Two accost instructions use two memory locations to perform an instruction—for instance, a block motion of N words from one location in memory to another, or a block add. The move may announced as follows:

(2.nine) Move N , M [ 100 ] , M [ 1000 ]

ii-and-l/2-accost instructions—This format uses two memory locations and a general register in the instruction. Typical of this type of instruction is an operation involving ii memory locations storing the effect in a register or an functioning with a general register and a retentivity location storing the result on another memory location, as shown:

(two.10) R [ A ] > > M [ 100 ] operator One thousand [ k ] One thousand [ 1000 ] > > M [ 100 ] operator R [ A ]

three-accost instructions—Another less mutual form of instruction format is the three-address teaching. These instructions involve three retention locations—two used for operands and i every bit the results location. A typical format is shown:

(2.eleven) M [ 200 ] > > M [ 100 ] operator M [ 300 ]

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555582609500023

Advanced Encryption Standard

Tom St Denis , Simon Johnson , in Cryptography for Developers, 2007

x86 Functioning

The AMD Opteron achieves a nice heave due to the add-on of the eight new general-purpose registers. If we examine the GCC output for x86_64 and x86_32 platforms, we can come across a dainty difference between the two ( Table 4.ii).

Tabular array 4.two. First Quarter of an AES Round

Both snippets accomplish (at least) the starting time MixColumns step of the first round in the loop. Note that the compiler has scheduled part of the second MixColumns during the showtime to reach college parallelism. Even though in Table 4.2 the x86_64 code looks longer, it executes faster, partially because it processes more of the 2d MixColumns in roughly the same fourth dimension and makes good use of the extra registers.

From the x86_32 side, we can clearly encounter various spills to the stack (in bold). Each of those costs usa 3 cycles (at a minimum) on the AMD processors (2 cycles on near Intel processors). The 64-flake code was compiled to accept cipher stack spills during the principal loop of rounds. The 32-fleck code has about 15 stack spills during each circular, which incurs a penalty of at least 45 cycles per round or 405 cycles over the course of the 9 total rounds.

Of course, we do not see the full penalisation of 405 cycles, every bit more than 1 opcode is existence executed at the aforementioned time. The penalisation is too masked past parallel loads that are also on the disquisitional path (such as loads from the Te tables or circular primal). Those delays occur anyways, so the fact that we are too loading (or storing to) the stack at the aforementioned time does not add to the cycle count.

In either case, we tin can improve upon the code that GCC (4.ane.1 in this case) emits. In the 64-scrap code, nosotros see a pairing of "shrq $24, %rdx" and "and1 $255,%edx". The andl functioning is not required since only the lower 32 $.25 of %rdx are guaranteed to have annihilation in them. This potentially saves upwardly to 36 cycles over the course of nine rounds (depending on how the andl operation pairs up with other opcodes).

With the 32-bit code, the double loads from (%esp) (lines 2 and iii) incur a needless three-cycle penalization. In the case of the AMD Athlon (and Opterons), the load store unit will short the load operation (in certain circumstances), only the load will always have at to the lowest degree iii cycles. Changing the second load to "movl %edx,%ebx" means that nosotros stall waiting for %edx, but the penalty is only one cycle, not three. That change lonely will gratis upward at most 9*ii*4 = 72 cycles from the nine rounds.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781597491044500078

Embedded Processor Architecture

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Register Operands

Source and destination operands tin be any of the follow registers depending on the education being executed:

32-chip general purpose registers (EAX, EBC, ECX, EDX, ESI, EDI, ESP, or EBP)

16-chip general purpose registers (AX, BX, CX, DX, SI, SP, BP)

8-bit full general-purpose registers (AH, BH, CH, DH, AL, BL, CL, DL)

Segment registers

EFLAGS register

MMX

Control (CR0 through CR4)

System Table registers (such as the Interrupt Descriptor Table annals)

Debug registers

Automobile-specific registers

On RISC embedded processors, at that place are generally fewer limitations in the registers that can be used past instructions. IA-32 ofttimes reduces the registers that can be used every bit operands for sure instructions.

Read full affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780123914903000059