Slides with (**) are noted for core lecture coverage
Other slide may be covered in discussion session
Abstraction: Separation of hardware and software by definition of an Instruction Set or Instruction Set Architecture (ISA). An ISA includes a specification of the set of opcodes and the the native commands implemented by a processor (https://en.wikipedia.org/wiki/Instruction_set)
Tools (compilers and assemblers) allow designers to create applications and design in a high-level programming language
Reuse, in general, is the ability to save design effort over multiple applications. Microprocessors themselves, the core and peripherals attached to them through bus connections, and high-software for microprocessors all represent reuse enable by use microprocessors
Scalability: Microprocessors have themselves scaled to larger word lengths and have also been involved in scaling to multi-processor systems and SoCs
A compiler or assembler converts source code into object code
Object Code file contains opcodes (instructions) and constants, and the supporting information to organize them in memory
A linker combines object code files into a single standalone executable file. It resolves unknowns such as addresses of variables or entry points for routines.
At minimum a processor requires the processing unit and memory for instructions
A loader (part of OS) is responsible for loading the executable and libraries into instruction memory for execution.
Typically to have separate memory space defined for program memory, mutable data memory, and constants
The processor fetches instructions from memory and uses its datapath to execute them.
C program with function calls, arrays, and global variables:
A cross-compiler is required to create an executable for a processor different from the processor used to the compiler.
We will make use of the GNU compiler toolchain.
Following commands run arm-linux-gcc. The default behavior covers both compiling and linking.
The command to generate the ARM assembly program is as follows.
$> /usr/local/arm/bin/arm-linux-gcc -c -S -O2 gcd.c -o gcd.s
The command to generate the ARM ELF executable is as follows.
$> /usr/local/arm/bin/arm-linux-gcc -O2 gcd.c -o gcd
$> /usr/local/arm/bin/arm-linux-objdump -d gcd
Pipeline processor trades latency for throughput. Typically uses higher clock rates. Uses hardware parallelism and works on more than one instruction at a time.
5-stage Pipeline shown
Also common to see 3-stage pipeline with Execute, Buffer and Writeback as a single stage
5-stage has more cycle latency, but faster clock speeds mitigate this
pipeline breaks up instruction processing into multiple stages
multiple instructions are processed at any given time
Instruction Fetch: The processor retrieves the instruction addressed by the program counter register from the instruction memory.
Instruction Decode: The processor examines the instruction opcode. For the case of a branch-instruction, the program counter will be updated. For the case of a compute-instruction, the processor will retrieve the processor data registers that are used as operands.
Execute: The processor executes the computational part of the instruction on a datapath. In case the instruction will need to access data memory, the execute stage will prepare the address for the data memory.
Buffer: In this stage, the processor may access the data memory, for reading or for writing. In case the instruction does not need to access data memory, the data will be forwarded to the next pipeline stage.
Write Back: In the final stage of the pipeline, the processor registers are updated.
hazards related to branching and dependencies of instructions on eachother lead to irregularity in timing
bgeid
Branch Immediate if Greater or Equal with Delay, bgei
is Branch Immediate if Greater or Equalj = j + i
. Because it is in a branch delay slot, it will be executed regardless if the conditional branch is taken or not.
i
reaching 99, executing the one-line body in the final condition, rather than i
reaching 100 and bypassing the body in the last iterationData hazards occur when a instruction must wait on the results of a previous instruction. Interlock hardware can resolve some dependencies by data forwarding, a hardware supported process by which results are made directly available to another instruction in the pipeline.
However, sometimes a hazard is unavoidable. In the example, the add must wait for the slow memory read ldr. In general, data hazards in a pipeline are predictable and may be analyzed at compile time, though caching and other variable-time memory access patterns exceptions to this.
Predictable timing may be important for an embedded design, sometimes more important than fastest speed. Understanding the hardware pipeline, hazards, and interlock hardware is a critical aspect of understanding timing (esp what is predictable and what is not)
Knowing hardware can lead to optimized software implementation...sometimes requiring analysis or coding in assembly
Use of caching is one key source of time variation in otherwise consistent loops
Understanding OS is another concern.
A real-time system is characterized by the ability to make time guarantees to complete tasks. A real-time OS only schedules processes with achievable deadlines
Mapping from C data types to physical memory locations is affected by several factors.
Checking Endianess
A similar program can be written to examine alignment.
Consider a struct variable
typedef struct {char a; int b;} test_t. test_t test;
Define char * cPtr
, assign (point) it to the address of your struct (use explicit casting) and print its dereferenced value as you advance the pointer forward in memory. This can be useful to understand struct packing compiler options.
ARM is bi-endian – it can be configured to work either way
Pg 208 code listing showing example of pulling data from memory to registers, operating on them , and then sending data back to memory. As can be seen by the disassembly, the use of memory for an accumulator is very inefficient.
void accumulate (int *c, int a[10]) { int i; *c = 0; for (i=0; i<10; i++) *c += a[i]; }
/usr/local/arm/bin/arm-linux-gcc -02 -c -S accumulate.c
Limited control of mapping of variables in the memory hierarchy is provided in C. It is offered through storage class specifiers and type qualifiers. Important ones are show in table 7.2. Some examples provided here:
These are important for access to memory mapped interfaces, which are hardware software interfaces that appear as memory locations to software. We discussed this in Chapter 11
/usr/local/arm/bin/arm-linux-gcc -O2 -c accumulate.c -o accumulate.o /usr/local/arm/bin/arm-linux-objdump -d accumulate.o
C
int accumulate (int a[10]) { int i; int c = 0; for (i=0; i<10;i++) c += a[i]; return c; } int a[10]; int one = 1; int main() { return one + accumulate(a); }
00000000 <accumulate>:
0: e3a01000 mov r1, #0
4: e1a02001 mov r2, r1
8: e7903102 ldr r3, [r0,r2,lsl #2]]
c: e2822001 add r2,r2,#1
10: e3520009 cmp r2,#9
14: e0811003 add r1,r1,r3
18: c1a00001 movgt r0, r1
1c: c1a0f00e movgt pc, lr
20: ea000000 b 8 <accumulate+0x8>
00000024 <main>:
24: e52de004 str lr, [sp,-#4]
28: e59f0014 ldr r0, [pc, #20] ; 44 <main+0x20>
2c: ebfffffe bl 0 <accumulate>
30: e59£2010 ldr r2, [pc, #16] ; 48 <main+0x24>
34: e5923000 ldr r3, [r2]
38: e0833000 add r3, r3, r0
3C: e1a00003 mov r0, r3
40: e49df004 ldr pc, [sp], #4
text pg 210: (slightly modified)
Compiling without optimizations to see Full-Fledged Stack Frame:
/usr/local/arm/bin/arm-linux-gcc -c -S accumulate.c
Compiling WITHOUT OPTIMIZATIONS utilizes a full-fledged stack frame which conservatively uses main memory to transfer data instead of attempting to use registers (such as to pass arguments)
pg. 212:
The instructions on lines 2 and 3 are used to create the stack frame. On line 3, the frame pointer (fp), stack pointer (sp), link register or return address (lr) and current program counter (pc) are pushed onto the stack. The single instruction stmfd is able to perform multiple transfers (m), and it grows the stack downward (fd). These four elements take up 16 bytes of stack memory.
On line 3, the frame pointer is made to point to the first word of the stack frame. All variables stored in the stack frame will now be referenced based on the frame pointer fp. Since the first four words in the stack frame are already occupied, the first free word is at address fp - 16, the next free word is at address fp – 20 and so on. These addresses may be found back in Listing 7.6.
The following local variables of the function accumulate are stored within the stack frame: the base address of a, the variable i, and the variable c. Finally on line 32, a return instruction is shown. With a single instruction, the frame pointer fp, the stack pointer sp, and the program counter pc are restored to the values just before calling the accumulate function.
Ex: static int i;
The heap area is shared by all threads, shared libraries, and dynamically loaded modules in a process and supports dynamic memory allocation
Stack is a LIFO structure supported in hardware by a stack pointer to track the top of the stack
A common memory organization has the stack growing towards and possibly into the heap area while the heap have variable space allocated closer and closer to the stack as more space is allocated
A collision occurs for instance when the stack grows into the heap space. stack frames and heap variables there can be corrupted. This is a common cause of random-like behavior of program counter as corrupted return address values are read from the stack. This can be seem when observing a running a program using a hardware debugging interface. It is very difficult to identify this problem through traditional debugging and only experience will tell you where to look for an explanation.
next we will discuss examining code section size.
toolchains include a size utility (e.g. arm-linux-size) to show the static size of a program – the amount of memory required to store instructions, constants, and global variables
The size utility can analyze the executable as well, which includes the linked libraries.
Libraries have a large impact on code size (and execution time) and should be selected with care in resource-constrained environments.
Often there are alternative low-foot print/lightweight alternatives provided for standard functions (: printf)
Note, the size utility does not show dynamic memory usage, and cannot predict the amount of stack or heap required since they are not determined by code compilation
A programmer must select the size for these based on factors link expected function call depth and number of local variables in those functions.
Reports on the text, data, bss sections along with additional sections like debugging info can be listed using the -h flag of objdump to get section header listing from the elf file
Symbols may be viewed using the -t flag with the objdump utility
Can also examine symbols using -Wl , -Map=.. option with linker to see the linker map file. This includes a listing of the object files used to create the executable.
Provided by -S with compiler (gcc)
Creates modulo.s:
Also provided from disassembly of object or executable using objdump utility with the -D flag
Being able to investigate assembly code (even for processors foreign to you) enables you to
From assembly, we can identify a similar optimization that would otherwise be directly described in C using pointer arithmetic
https://en.wikipedia.org/wiki/Application_binary_interface
https://en.wikipedia.org/wiki/Application_binary_interface
An embedded-application binary interface (EABI) specifies standard conventions for file formats, data types, register usage, stack frame organization, and function parameter passing of an embedded software program.
Compilers that support a given EABI create object code that is compatible with code generated by other compilers supporting the same EABI, allowing developers to link libraries generated with one compiler with object code generated with another compiler. Developers writing their own assembly language code may also use the EABI to interface with assembly generated by a compliant compiler.
The main differences between an EABI and an ABI for general-purpose operating systems are that privileged instructions are allowed in application code, dynamic linking is not required (sometimes it is completely disallowed), and a more compact stack frame organization is used to save memory. The choice of EABI can affect performance.
Widely used EABIs include PowerPC, ARM EABI2 and MIPS EABI.
Many processors can be simulated with an instruction-set simulator, a simulator design to simulate the behavior a a specific processor
A cosimulation platform is one that supports different description languages/formats or different types of simulation.
Book provides a discussion on gplatform.
int gcd (int a, int b){ while (!=b) { if (a > b) a = a - b; else b = b - a; } return a; } void instructiontrace(unsigned a){} asm("swi 514"); } int main() { instructiontrace(1); a = gcd(6, 8); instructiontrace(0); printf("GCD = \%d\n", a); return 0; }
/usr/local/arm/bin/arm-linux-gcc -static -S gcd.c -o gcd.S
Fig 7.12 Mapping of address 0x8524 in a 32-set, 16-line, 32-bytes-per-line set-associative cache
HDL simulation can replace a instruction-set simulator and provide more detail but there are caveats
Need a (detailed) model to perform detailed simulation. Can be a problem with complex protected IP
Typically slower – though more detail is available a higher level abstraction may be appropriate
HDL simulation can be more cumbersome, such as loading program memory. A dedicated tool makes this process manageable
Using a uniform HDL simulation platform to replace cosimulation represents a loss of abstraction for analysis and understanding, such as the hardware – software divide where different types of analysis and tools may be appropriate
(**)