SecureRISC Instruction Set Architecture

Documentation Outline

This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fit.

Work In Progress Documentation

Most of this document has not incorporated the SecureRISC2 changes planned. Over time I will update the figures and text. At this point the only the changes are one description of the new pointer format, and the addition of the XRs.

Table of Contents

Introduction

SecureRISC is my attempt to explore my thoughts on a security-conscious Instruction Set Architecture (ISA) appropriate for server class systems, but which with modern process technology (e.g. 5 nm), could even be used for IoT computing given the die area for such a single such processor is a small fraction of one mm2. I start with the assumption that the processor hardware should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as found in Multics, including rings, segmentation, discretionary and non-discretionary access control. I also propose a new block structured instruction set that allows for better Control Flow Integrity (CFI) and performance. Modern processors have various structures that are updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Address predictors (RAS), Conditional Branch predictors, Indirect Jump and Call predictors, and so on. In SecureRISC a few of these are moved into the ISA for performance and security. In particular the BTB is made explicit in SecureRISC in the form of basic block descriptors, and the RAS is made explicit in the form of an explicit return address stack in virtual memory, which is kept secure for purposes of Control Flow Integrity (CFI).

Perhaps the most controversial aspect of SecureRISC are tags on words in memory and registers. The Basic Block descriptors may be more unusual, but I think the reader will come to appreciate this aspect of the proposal, but the reader may in the end not find memory tags convincing. I went in this direction because I thought it was synergistic between security, support for Garbage Collection (itself a security advantage), and support for dynamically typed languages. The first reason was that tags allowed me to associate sizes with pointers with low overhead. Unlike CHERI, pointers have only a size, and not bottom and top values encoded. As a result SecureRISC is more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so SecureRISC is better suited to languages other than C++, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. I suspect it would be a better target for Rust, Swift, or Julia. I have reserved a tag for C++ pointers, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per-process basis whether C++ pointers can be used; if disabled they would cause exceptions. A secure system might only allow C++ pointers for applications without internet connectivity.

Background

The original motivation for block-structured ISAs was ILP studies that I did back in 1997 that showed that instruction fetch was the limiting factor in ILP. This was before modern branch prediction, e.g. TAGE, so that result may no longer be true. The idea was that instruction fetch is like linked list processing, with parsing at each list node to find the next link. I wanted to replace linked lists with vectors, but couldn’t figure out how, and settled for reducing the parsing at each list node. I still feel that this is worthwhile, but the exact tradeoffs might require updating older work in this area. The best validation of this dates from 2007, when Professor Christoforos Kozyrakis got his PhD student Dr. Ahmad Zmily to look at this approach in a PhD thesis. In the introduction of Block-Aware Instruction Set Architecture Dr. Zmily wrote, We demonstrate that the new architecture improves upon conventional superscalar designs by 20% in performance and 16% in energy. Such an advantage is not enough on which to foist a new ISA upon the world, but it encourages me to think that it does provide impetus for using such a base when creating a new ISA for other purposes, such as security.

Long after I began making notes on a block-structured ISA, I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. I have extended SecureRISC to outline how it might support CHERI capabilities. The SecureRISC2 variant incorporates a new sized pointer format based on ideas from CHERI. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC below.

Goals

The goals for SecureRISC in order of priority are:

  1. Security
  2. Performance
  3. Power efficiency
  4. Compatibility where not in conflict with the above
  5. Code size (primarily for performance)
  6. Support for Garbage Collection (GC)
  7. Support for languages with dynamic typing (e.g. Lisp, Python, Julia)

Non-goals for SecureRISC include (this list will probably grow):

Security can mean many things. One of the most important is preventing unassisted infiltration (e.g. through exploiting buffer overflows, and perhaps use-after-free errors). Another is preventing unintentionally-assisted infiltration (e.g. phishing attacks installing trojans), which may be accomplished through non-discretionary access control.

Tagged memory words are separable from other aspects of SecureRISC, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC2 without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it at this time. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag, since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/SR separation, with the ARs specialized for CHERI capabilities, and the SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.

One could imagine variants of SecureRISC that have only some of its features:

Name Block ISA Segmentation Rings Tags CHERI Word Pointer
SecureRISC 72 72/144
SemiSecureRISC 64 64
BlockRISC 64 64
BlockCHERI ? ? 65 130

As I indicated earlier, I don’t think that BlockRISC is sufficient in itself to justify a new ISA. I am concentrating on the full package.

Status

Open Source

As indicated above, this effort began as an exploration of what a security-conscious ISA might look like. Should it someday turn into something more than an exploration, my intent would be to make it an Open Source ISA, along the lines of RISC-V.

Differences From SecureRISC

SecureRISC2 is a variation of SecureRISC that I am exploring. The primary differences are:

A pointer now has the following form:

Sized Pointer
71 67 66 64 63 61 60 51 50 0
SS ring SG segment size 0 byte address
5 3 3 10 2

where the boundary between the byte address and size, and the scaling of the size value is determined by SS field for values 0 to 22. Value 23 of SS is used for pointers with no size field. Values 24 to 31 of SS and the ring field are used for dynamic type tagging with either data in bits 63..0 or a pointer with a size implied by the tag (e.g. 2 words for the CONS tag). The byte address occupies bits SS+25..0 and the size occupies bits 50..SS+27 and has value ptr50..SS+27 ∥ 0SS×2+3. For example:

SS Address Size Granularity (bytes)
bits width bits width
0 25..0 26 50..27 24 8
1 26..0 27 50..28 23 32
22 47..0 48 50..50 1 247
23 50..0 51 n.a.

New And Old

Conventional Aspects

Some things remain unchanged from other RISCs. Addresses are byte addressed. Like other RISC ISAs, SecureRISC is is mostly based upon loads and stores for memory access. Floating-point would be IEEE-754-2019 compatible. The Vector ISA will probably be similar to the RISC-V Vector ISA, but might however use the 48‑bit instruction format to do more in the instruction word and less with vset. Also, there are four explicit vector mask registers, rather than using v0.

Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional, because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture, since it echos an ISA from 1976, but is somewhat different from RISCs since the 1980s.

Unconventional Aspects

Much more in SecureRISC is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.

Advantages of the Basic Block Descriptor

The Basic Block descriptor aspect listed above is perhaps the most unfamiliar. To help motivate it for the reader, below are some of the advantages of this approach.

Open Aspects

I need to think more carefully about I/O in a SecureRISC system. Certainly some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g. DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.

Ports onto the system interconnect fabric will have to have rights and permissions assigned by the hypervisor, and perhaps hypervisor guests. This needs to be worked out.

Being able to support user-mode I/O would be desirable, but it seems difficult to make this work, because then the user ring code would be sending its own local virtual addresses to the I/O device for DMA, and so the I/O devices would have to be able to translate user addresses to system interconnect addresses via two-level page tables and user-mode would have to tell the I/O device the page table the supervisor assigned it, which it doesn’t know. At the moment, I have left this unaddressed.

Capability Hardware Enhanced RISC Instructions (CHERI)

I started with the assumption that pointers are a single word, which are expanded based on the 8-bit tag to a base and size when loaded into the doubleword (144‑bit) Address Registers (ARs). The address and size supports pointers giving the base address and size, where indexing using the pointer checks the index value against the size. This supports programs oriented toward a[i] pointer usage, but not C++ *p++ pointer arithmetic. In contrast, the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project started with the assumption that capability pointers are four words (including lower and upper bounds, the pointer itself, and permissions and object type), and invented a compression technique to get them down to two words. SecureRISC can support CHERI by using its 128‑bit AR load and store instructions to transfer capabilities to and from the 144‑bit ARs, and therefore be able to accommodate either singleword or doubleword pointers. Support for the CHERI bottom and top decoding, its permissions, and its additional instructions would be required. The CHERI tag bit is replaced with two SecureRISC reserved tag values (one tag value in word 0, another in word 1). I would expect languages such as Julia and Lisp would prefer singleword pointers, so supporting both singleword and doubleword pointers allows both to to exist on the same processor depending on the instructions generated by the compiler.

Documentation Conventions

Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC is similarly Little Endian in its byte addressing.

Basic Terminology

Multics Terminology (Multicians may mostly skip)

Segment
Segments are the basic unit of access control and sharing, and are typically used for mapping files into the address space of processes. Segments may be paged or may be directly mapped (e.g. to an I/O device). Segments have power-of-two sizes that are used for bounds checking and for determining the depth of the page table walk when paging is used. (Multics segment lengths were not limited to powers of two sizes, but arbitrary lengths significantly increase the size of Segment Descriptors, whereas only six bits are required for powers of two.) Segment sizes < 12 (4096 bytes) would not be supported in most micro architectures, and the maximum segment size is 61 (2 EiB or exbibytes). Segment sizes > 48 (256 TiB or terabyte) require the Segment Descriptor Table (set by the supervisor) to have 2size-48 entries with consistent values.
Segments and pages are also used to implement generational garbage collection (GC) by extending Segment Descriptors and Page Table Entries and TLB entries to have a generation number so that stores of pointers from an old generation to a new one are noted in PTE bits in a fashion similar to the usual Dirty bits. This allows pages that are about to be swapped out to be scanned for pointers to newer generations and that noted, so that these pages need not be swapped in during GC.
Ring
Rings enforce a layering upon what may be read or written by code on a per-segment basis, with ring 0 being the most privileged and ring 7 being the least privileged. Higher rings may also call gates in lower rings to request services from the lower rings. Ring numbers are stored in pointers so that pointer parameters passed to lower rings result access to virtual memory using the access rights of the caller, and not the rights of the lower ring. Loading a pointer from memory sets the ring field to the maximum of the current ring of execution, the ring number in the base register of the load, and the ring number stored in memory. A special instruction used in gates is applied to pointers passed in registers to apply this maximum calculation using the ring number of the caller.
The number of rings could be reduced from 8 to 4 to gain one extra local virtual address bit (increasing the number of segments from 8192 to 16384). Perhaps in such a system ring 0 would be for the hypervisor, ring 1 for the guest operating system, ring 2 for user code, and ring 3 for sandboxed user code.
Ring brackets
Each segment has three 3‑bit ring numbers—R1, R2, and R3—used for bracketing accesses. Writes are permitted when the current ring of execution is in [0:R1], reads in [0:R2], execution in [R1:R2], and calls in [R2+1:R3].
Gate
Gates are the entry points into lower rings from higher rings and are marked as such by a bit in basic block descriptors. Higher rings may call directly to lower rings without employing the exception mechanism (not employing exceptions is a performance advantage). When the target segment does not allow execution to the current ring (i.e. the current ring is greater than the target segment’s R2), but does allow calls (i.e. the current ring is in the target segments’s [R2+1:R3]), a ring transition takes place. Only basic block descriptors marked as gates (as indicated in the descriptor) may be used for such transfers. Gates are responsible for stack switching, validating the ring numbers of pointer arguments passed in registers, and clearing non-preserved registers before return.
Discretionary Access Control
The operating system maintains an Access Control List (ACL) for files. When files are mapped into a user address space, this ACL is mapped to permissions in the Segment Descriptor for that user. Those permissions are Read (R), Write (W), Execute (X), Pointer (P), and Capability (C) permissions.
Non-Discretionary Access Control
Non-Discretionary Access Control prevents access by independent of ACLs by implementing the Orange Book classification system. In addition to its primary purpose, it protects against trojan attacks. The Orange Book calls for two concepts: levels and categories. In SecureRISC this is simplified to just categories, with N levels encoded with 2N category bits. Read access is granted to a segment when SegmentCategories ⊆ ProcessCategories. Write access is granted when SegmentCategories = ProcessCategories.
It is possible that Non-Discretionary Access Control could be used to implement generational garbage collection. This needs to be considered.

Tagged Pointer Terminology

Word
Memory Word
71 64 63 0
tag data
8 64
Words are 72 bits in memory with 64 bits of data and 8 bits of tag. The tag is primarily used for pointers to give the size of the memory addressed by the pointer, but tag values ≥240 are reserved for 64 bits of data contained directly in the word, rather than what the word points to. This allows dynamically typed languages such as Lisp to have 64‑bit integer and 64‑bit floating-point data as objects without allocating memory to contain it. Tags <240 represent pointers.
A doubleword is two words of course. I avoid the terms half-word, quarter-word, because there is no such thing as half or a quarter of tag. Loads and stores that reference 8‑bit, 16‑bit, and 32‑bit fields of a 64‑bit integer are named with the number of bits (8, 16, or 32) and a Signed/Unsigned indication (S/U). Such loads and stores take an exception if the tag in the word is not an integer tag. There may also be a few redundant tags used that duplicate pointers to a few fixed sizes areas for type distinguishing purposes, such as Lisp’s CONS being a separate type from other two-word pointers. Various numeric types with more than 64 bits of data have their own tag values making it easy to test. Since most tagged pointers have the low three bits zero, it is also possible to encode dynamic type information there. Basic block descriptors have their own special tag, as does the header word of memory area > 128 words.
Tagged Non-pointer Data
64‑bit integer data
71 64 63 0
240 integer
8 64

IEEE-754 64‑bit floating point data
71 64 63 0
244 float64
8 64
Null/Nil pointer
Null Pointer
71 64 63 61 60 0
0 ring 0
8 3 61
A pointer to 0-length data (tag 0) and address 0 is used as a null pointer. Any reference through this pointer causes an exception (Lisp may need a special load instruction to get nil on CAR and CDR of nil). There are BEQN and BNEN 16‑bit instructions for branching on null pointers, since this is so common.
Sized Pointers
Sized Word Pointer
71 64 63 61 60 3 2 0
1..128 ring word address 0
8 3 58 3

Sized Byte Pointer
71 64 63 61 60 0
129..135 ring byte address
8 3 61

C++ Pointer
71 64 63 61 60 0
137 ring byte address
8 3 61

Tags 1 to 128 represent pointers to that number of words (8 B to 1 KiB).
Tags 129 to 135 represent pointers to 1 to 7 bytes, and are primarily used for reference parameters.
Tag 137 is used for C++ pointers without an associated size. Use of these pointers disables bounds checking. The ability to use tag 137 must be enabled by the supervisor; if not enabled then these pointers cause an exception.
Tag 136 is used for pointers to regions > 128 words where the size is stored at the pointer address − 8:
Pointer with size at virtual address − 8
71 64 63 61 60 3 2 0
136 ring word address 0
8 3 58 3

Providing a special tag for headers and trailers of allocated blocks allows a backward scan to find the start of the block. This may be useful in some applications.
Size word stored at pointer − 8
71 64 63 61 60 3 2 0
254 0 word count 0
8 3 58 3

Size word stored at pointer + size
71 64 63 61 60 3 2 0
254 7 − word count 0
8 3 58 3
Code Pointers
Code pointers are used for function calls and returns, and for implementing switch statements. CHERI capabilities may also be used as code pointers. Calls and jumps using pointers without tag 138 or 139 trap.
Pointer to Basic Block Descriptor
71 64 63 61 60 3 2 0
138 ring BB descriptor word address 0
8 3 58 3
CHERI Capabilities
CHERI capabilities are stored in memory doublewords and may be loaded into ARs with the LAC instruction. Word 1 of a CHERI capability is given a special tag. The word 0 and 1 tag values of CHERI capabilities may only be created by ring 0 and by CHERI instructions that derive from other CHERI capabilities.
Word 0 of CHERI capability
71 64 63 61 60 0
139 ring Local virtual address
8 3 61

Word 1 of CHERI capability
71 64 63 0
252 CHERI capability bits
8 64
Trap on load or store
One tag is defined to cause an exception when it is referenced on a load or store. This is useful for detecting accesses to freed memory, which is a source of security issues. A special instruction is provided to overwrite such words. Trap on load is also useful for dynamic linking.
Trap on load or store tag
71 64 63 0
255 data
8 64

Dynamic Typing

As noted earlier, it is useful to provide tags for Common Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).

Tag Lisp Julia Data use
1..128 simple-vector? Tuple? Pointer to N words
129..135 no dynamic typing use
136 simple-vector? Tuple? Pointer to N words
137..223 no dynamic typing use
224 CONS Pointer to a pair
225 Function Pointer to a pair
226 Symbol Pointer to structure
227 Structure Structure? Pointer to structure
228..229 no dynamic typing use
230 Array Pointer to structure
231 Vector Pointer to structure
232 String Pointer to structure
233 Bit-vector Pointer to structure
234 Ratio Rational Pointer to pair
235 Complex Complex Pointer to pair
236 Bigfloat BigFloat Pointer to structure
237 Bignum BigInt Pointer to structure
238..239 no dynamic typing use
238 Int128 Pointer to pair,
−2127..2127−1
239 UInt128 Pointer to pair,
0..2128−1
240 Fixnum Int64 −263..263−1
241 UInt64 0..264−1
242 Character Bool, Char,
Int8, Int16, Int32,
UInt8, Uint16, Uint32
UTF-32 + modifiers,
subtype in upper 32 bits
251 no dynamic typing use
244 Float Float64 IEEE-754 binary64
245 Float16, Float32 subtype in upper 32 bits
246..255 no dynamic typing use

Python and Other Language Types

In addition to Lisp types, SecureRISC could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.

Block Oriented ISA Terminology

Basic Block
A series of instructions with control transfers only before the first instruction and after the last.
Basic Block descriptor
Basic Block Descriptor
71 64 63 61 60 50 49 41 40 25 24 15 14 11 10 9 6 5 0
253 hint targr targl start offset size c next prev
8 3 11 9 16 10 4 1 4 6

All control transfers are to Basic Block (BB) descriptors, not to instructions. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. For basic blocks with conditional branches, the conditional branch prediction is made when the basic block descriptor is executed, and checked when the conditional branch instruction in the basic block is executed. The conditional branch instruction only has the operands to decide on taken or not-taken; the branch offset is stored in the descriptor, not the instruction. Thus conditional branches look like other ALU instructions and may occur anywhere in the basic block, and need not be the last instruction (earlier placement may reduce the branch misprediction penalty).
The size of the basic block (0..8) is given in 32‑bit units (0 to 32 bytes, which would typically allow for 8 to 16 instructions for 32‑bit and 16‑bit instructions) in a BB. The size values 9..15 are reserved. If the BB size is larger than 32 bytes, then it is continued using a fall-through next field. The 15‑bit start field gives a bit mask specifying which 2‑byte locations start instructions, which allows parallel instruction decode to begin as soon as the instruction bytes are read from the instruction cache. The start bit for the first 16 bits is implicitly 1 and is not stored. The last 1 bit in the start field represents the end of the last instruction. If the last instruction ends before a 32‑bit boundary, the last 16 bits is filled with an illegal instruction.
The tentative design choice is to increase locality by storing basic group block descriptors and instructions into 4 KiB regions of the address space (called bages) with the basic block descriptors in the one half and the instructions in the other half (the compiler might alternate the half used for even and odd bages to minimize set conflicts). This allows the pointer from the descriptor to 32‑bit aligned instructions to be only 10 bits, and in a paged system, the same TLB entry maps both the descriptors and instructions (since bage size ≤ page size), so only the BB engine requires a TLB (its translations are simply forwarded to the instruction fetch engine). The instructions are fetched from
PC63..12 ∥ offset ∥ 02
in parallel with the BB engine moving to fetch the next BB descriptor. For non-indirect branches and calls, the target is given by an 11‑bit signed relative 4 KiB delta from the current bage and a 9‑bit unsigned 8‑byte aligned descriptor address within that bage. Specifically
TargetPC ← PC63..61 ∥ (PC60..12 + (targr1038∥targr)) ∥ targl ∥ 03.
The low targl field is sufficient to index a set-associative BB descriptor cache that uses bits 11..3 (or a subset) as a set index without waiting for the targr addition giving the high bits. As an example, a 32 KiB, 8‑way set associate BB descriptor could read the tags in parallel with completing the addition giving the high address bits for tag comparison. When targr = 0, the TLB translation for the current BB remains valid, and energy can be saved by detecting this case. (Implementation note: of course the addition could also be done when the descriptor is fetched into the L1 BB Descriptor Cache to further speed things up at the cost of making this cache address space dependent.)
For even bages, BB descriptors start at the beginning of a bage, and instructions start on a 64‑byte boundary in the bage. Any padding between the last BB descriptor and the first instruction employs an illegal tag. For odd bages, BB descriptors are typically packed at the end starting on a 64‑byte boundary and the instructions start at the beginning. Intermixing BB descriptors and instructions is possible, but is not ideal for prefetch or cache utilization.
The c field indicates that the BB contains a LOOPX or LOOPXI instruction, which tells the BB engine to predict the count until the AR engine sends the actual loop count value back. Often the AR engine does so before the final iteration, and the loop is predicted precisely even if the loop count prediction is simply infinite.
The hint field will be defined in the future for prediction hints specific to each next field value. For example, conditional branches will use the hint field with a taken/not-taken initial value for prediction, a hysteresis bit (strong/weak), and an indication of whether global history is likely to be useful in prediction. Similarly indirect jumps and calls may have hints appropriate to their prediction. More hint bits would be nice to have.
Note: A future expansion of BB descriptor types is possible by employing another tag (e.g. 251).
Program Counter
The Program Counter (PC) is a processor register giving the current Basic Block Descriptor (a 8‑byte aligned pointer with tag 138). In normal operation a Basic Block is executed in its entirety, and so the instruction within the Basic Block need only be identified when exceptions stop execution in the middle of a basic block. Calls only store 8‑byte aligned values and returns trap on non-aligned values. Exceptions do store the full PC.

Other SecureRISC Terms and Concepts

Stack
Stacks grow upward rather than downward. Each process thread is given two stacks per ring, each in its own segment. One stack of the pair is used only for return addresses, which are pushed and popped in a specialized return address stack cache containing several cache lines of return addresses (typically 2–8 lines representing 16–64 return addresses). The return address stack segment is typically write-protected from the current ring of execution except during call operations, and can only be manipulated by calling a lower ring. The return address stack cache is kept coherent with the processor data caches used by load and store instructions.
Downward growing stacks made sense when the heap and stack were at opposite ends of a small address space, and bi-directional growth allowed each to grow to fill the space between without predetermined limits to each. This is no longer necessary with each occupying its own segment. Downward stack growth also allows for positive offsets from the stack pointer to access stack locations, but to maximize the range of reach of limited immediate offsets, one would bias the SP by half of the immediate size anyway, which makes this unnecessary.
Address Translation
SecureRISC implements two levels of address translation, as in processors with hypervisor support and virtualization, but I have invented new terminology for the process, because physical address is somewhat ambiguous in a two-level translation. Programs operate using local virtual addresses. These addresses are translated to a system virtual address in a mapping specified by guest operating systems. The guest operating systems consider system virtual addresses as representing physical memory, but actually these addresses are translated again by a system-wide mapping specified by the hypervisor to system interconnect addresses that are used in the routing of accesses in the system fabric. All ports on the system interconnect translate system virtual addresses to system interconnection addresses in local TLBs at the boundary into the system interconnect. This allows guest operating systems to transmit system virtual addresses to directly to I/O devices, which may transfer data to or from these addresses, employing the system-wide translation at the port boundary.
Local Virtual Address
63 61 60 58 57 48 47 0
ring SG SEG offset
3 3 10 48

System Virtual Address
63 50 49 0
region offset
14 50
Control-flow Integrity
Many attacks on conventional processors exploit sneaking trojan data into the memory of a process. Since that memory typically lacks execute permission, the attacker instead depends upon causing existing instructions to execute the attacker’s algorithm using bogus data. Return-oriented programming (ROP) is one method to exploit existing instructions by overwriting the return address on the stack so that the return transfers to carefully chosen address that executes a few instructions and then returns to a new address. Often only a portion of a basic block containing a return is executed in this way. The basic block descriptor mechanism defeats this, but in addition it is possible to add bits to the descriptor to trap when a return targets a basic block that is not following a call. In addition by moving the return address stack into protected memory, overwrites are prohibited.
Branch avoidance
SecureRISC has several features that reduces the demands on branch prediction, which improves performance. The Boolean Registers (BRs) are one aspect of the ISA that enables some branch avoidance.
Trap instructions
SecureRISC contains a rich set of trap instructions that cause an exception based on various conditional tests. This allows the compiler to supplement the checking mandated by the SecureRISC ISA with its own checks. Trap instructions do not use branch prediction resources and in some micro-architectures are almost free to execute with minor performance impact, except for their code size and fetch bandwidth requirements.
Loop count
Rather than depending upon the conditional branch predictor to predict loop iteration counts, SecureRISC defines instructions to communicate inner loop iteration counts to the BB engine and to indicate how to check predictions made thereby. This feature is initiated by the LOOPX or LOOPXI instructions with the number of iterations prior to the start of the loop in a BB with the c bit set in its descriptor. The microarchitecture employs count prediction on such BBs. This prediction is be replaced by the actual value when the LOOPX or LOOPXI executes in the AR engine, which is often before the first or second loop back. When the last BB of the loop wants to loop back, it uses a BB descriptor next code of loop back or loop back conditional decrementing the predicted count and branching to the target if not zero. This feature allows SecureRISC to achieve DSP-like performance on simple loops and reduces the burden on the branch predictor, making it more effective on real conditional branches. The BB containing a loop test must also contain the SOB instruction to decrement the actual loop iteration count in an XR.
Pointer Permission
In addition to Read, Write, and Execute (in the form of the number of BB descriptors per 4 KiB region) permissions, SecureRISC includes Pointer and Capability permission bits in Segment Descriptors. Only segments with the P bit set are allowed to contain pointers to other segments. Stack and heap segments would typically have P set, but code and mapped data files would have P clear. Segments with P clear may only contain local pointers, which consists of just the offset within the segment. A special instruction allows such pointers to be converted to full pointers when loaded into an Address Register. This allows a database to contain internal pointers that are independent of the address to which the segment is mapped at runtime.
Capability Permission
Capability Permission allows the segment to contain CHERI capabilities.

Instruction Set

The user process state includes:

Name Depth Width Read ports Write ports Description
PC 1 3 + 58 + 5 The Program Counter hold the current ring number, Basic Block descriptor address, and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions.
CSP 8 3 + 58 The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks. The Program Stack Pointer is held in an AR designated by the Software ABI. There is one CSP per ring.
CARRY 1 64 The Carry register is used on multiplication as an implicit input and output on multiplication as follows:
p ← SR[c] + (SR[a] ×u SR[b]) + CARRY
SR[d] ← p63..0
CARRY ← p127..64

It could also be used in the ADDC instruction as follows:
s ← SR[a] +u SR[b] + CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64

but in this case it may be preferable to use BR source and destinations instead.
VL 1 64 The Vector Length register specifies the length of vector loads, stores, and operations.
VSTART 1 7 The Vector Start register is used to restart vector operations after exceptions. Details to follow.
VM 4 128 The Vector Mask register file stores a bit mask for elements of vector operations. VM[0] is hardwired to all 1s and is used for unmasked operations.
AR 16 133 2 1 The Address Register file holds pointers and integers to perform calculations related to control flow and to load and store address generation. No AR is hardwired to 0. Bits 63..0 are address or data (bits 63..61 are the ring number if address), bits 71..64 are the tag, and bits 132..72 are the size expanded from the tag, potentially set by reading address−8.
In some micro-architectures, operations on ARs are executed speculatively. (Non-AR operations may be queued until non-speculative, or may be speculatively executed as well.)
XR 16 72 2 1 The Index Register file holds integers to perform calculations related to control flow and to load and store address generation. No XR is hardwired to 0. Bits 63..0 are data and bits 71..64 are the tag. The XR primarily holds integer tagged data, but other tags may be loaded.
In some micro-architectures, operations on XRs are executed speculatively. (Non-XR operations may be queued until non-speculative, or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction.
SR 16 72 3 1 The Scalar Register file holds data for computations not involved in address generation and primarily hold integer or floating-point values. Tags are stored, and so SRs may be used for copying arbitrary data, including pointers, but no instruction uses SRs as an address (e.g. base) register. Integer operations check for integer tags, and floating-point operations check for float tags. No SR is hardwired to 0.
In some micro-architectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts.
BR 16 1 3 1 Boolean Registers hold boolean values, such as the result of comparisons and logical operations on other boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions.
VR 16 72 × 128 3 1 Vector Registers hold vectors of tagged data, typically integers or floating-point data.

The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.

Basic Block Descriptor Types

The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:

Value Description
0 Unconditional branch: The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below.
1 Conditional branch: The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one branch, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
2 Call: The address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below.
3 Conditional Call: The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8.
4 Loop back: The predicted loop iteration count is used to predict whether this loop is taken or not, and this prediction is checked by the SOB instruction in the instructions of the basic block. There should be exactly one SOB, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
5 Conditional Loop back: The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the predicted loop iteration count is used to determine whether this loop is taken or not, and this prediction is checked by the SOB instruction in the instructions of the basic block. There should be exactly one SOB, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
6 Fall through: This Basic Block is unconditionally followed by the BB at PC + 8.
7 Reserved.
8 Jump Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions.
9 Conditional Jump Indirect: The branch predictor is used to determine whether this jump indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the jump indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the jump is not taken the destination is fall-through BB descriptor at PC + 8.
This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JUMP uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB region (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used).
10 Call Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. The address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8.
11 Conditional Call Indirect: The branch predictor is used to determine whether this call indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8.
12 Return: The Call Stack cache is used to predict the return using CSP[PC63..61] − 8 as the index and CSP[PC63..61] is decremented by 8.
13 Reserved.
14 Reserved.
15 Reserved.

The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB as a set of bits:

Bit Description
1..0
0 ⇒ no call/gate, bits 5..2 used
1 reserved (potential for exception entry?)
2 ⇒ call allowed
3 ⇒ gate allowed
2 Fall through to this BB allowed
3 Branch/Loopback to this BB allowed
4 Jump to this BB allowed (for case dispatch)
5 Return to this BB allowed

Call/Return Details

Basic Block descriptors with one of the four call types (Call, Conditional Call, Call Indirect, Conditional Call Indirect), push the return address on a protected stack addressed by the CSP indexed by the target ring number (which is the same as the current ring number unless a gate is addressed). Returns pop the address from the protected stack and jump to it. The ring number of the CSP pointer is used for the stores and loads, and typically this ring is not writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC63..61]) ← PC
CSP[TargetPC63..61]) ← CSP[TargetPC63..61]) +p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC63..61] −p 8)
CSP[TargetPC63..61]) ← CSP[TargetPC63..61]) −p 8

Overflow Checking

Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC provides a reasonably complete set of such instructions in addition to the usual mod 264 add, subtract, negate, multiply, and shift left.

Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDUO (Add Unsigned with Overflow).

In addition the ADDSO, ADDUSO (Add Signed with Overflow), SUBSO (Subtract Signed with Overflow), SUBSUO (Subtract Signed Unsigned with Overflow), SUBUSO (Subtract Unsigned Signed with Overflow), and NEGO (Negate with Overflow) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.

Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 60..size (this is hard to implement—might need another cache?).

Branch Avoidance

SecureRISC has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC-V ISA would use conditional branches:

	move x6, x1
	blt x1, x3, L
	move x6, x3
L:

The performance of the above on contemporary micro-architectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC, the sequence could be as follows:

	lt b2, s1, s3
	sel s6, b2, s1, s3

This sequence involves no conditional branches, and has consistent performance.

As another example, the range test

	assert ((lo <= x) && (x <= hi));

on RISC-V would compile to

	blt x, lo, T
	bge hi, x, L
T:
	jal assertionfailed
L:

but on SecureRISC would compile to

	lt b1, x, lo
	orle b0, b1, hi, x

which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torle b1, hi, x as equivalent to the above orle by supplying the b0 destination operand.

Even when conditional branches are used, the boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be

	lt b1, x, lo
	borle b1, hi, x, outofrange

which has one branch rather than two.

Tag Checking

Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer tagged operands and produce an integer tagged result. Floating point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g. to extract the exponent) requires a CAST instruction to first change the tag. Similarly to perform logical operations on a pointer, a CAST instruction to integer type is required.

Comparisons of tagged values compare the entire word in its entirety for =, , <u, u etc. This allows sorting regardless of type. Similarly the CMPU operation produces −1, 0, 1 based on <u=>u of word values.

Shifts

One advantage of the 3 read SR file is that shifts can be based upon a funnel shift where the value to be shifted is the catenation of SR[a] and SR[b], allowing for rotates by specifying the same operand for the high and low funnel operands, and multiword shifts by supplying adjacent source words of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left shifts supply 0 for the lo side of the funnel and use a negative shift amount. Logical right shifts supply 0 on the high side of the funnel and arithmetic right shifts supply a signed-extended version of SR[a] on the high side of the funnel. Need to decide whether overflow detecting left shifts are required.

The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.

Multiword Addition

The Add with Carry instruction ADDC is defined to take a BR source as a carry-in and BR destination as a carry-out as a destination for multiword addition. The definition is then
s ← SR[a] +u SR[b] +u BR[c]
SR[d] ← s63..0
BR[e] ← s64
.
An alternative requires fewer operands, but uses one bit in the 64‑bit CARRY register:
s ← SR[a] +u SR[b] +u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64
.

Multiword Multiplication

The ideal multiplication operation would be
SR[e],SR[d] ← (SR[a] ×u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4 reads and 2 writes, which we clearly don’t want. The tentative alternative is to introduce a 64‑bit CARRY register to provide the additional 64‑bit input to the 128‑bit product and a place to store the high 64 bits of the product. This requires some careful thought for OoO micro-architectures and so is a tentative proposal. It may be that even an OoO processor will be called on to have a subset of instructions that are to be executed in-order relative to each other, and the multiword arithmetic instructions can be put in this queue.

Instruction Formats and Overview

The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.

The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream.

16‑bit instruction format destination 2 source
15 12 11 8 7 4 3 0
b a d op1
4 4 4 4

ADDAd, a, b AR[d] ← AR[a] +p XR[b]
ADDXd, a, b XR[d] ← XR[a] + XR[b]
ADDd, a, b SR[d] ← SR[a] + SR[b]
LAd, a, b AR[d] ← lvload72(AR[a] +p XR[b]×8)
LXd, a, b XR[d] ← lvload72(AR[a] +p XR[b]×8)
Ld, a, b SR[d] ← lvload72(AR[a] +p XR[b]×8)

16‑bit instruction format destination source immediate
15 12 11 8 7 4 3 0
imm4 a d op1
4 4 4 4

ADDAId, a, imm4 AR[d] ← AR[a] +p imm4
ADDXId, a, imm4 XR[d] ← XR[a] + imm4
ADDId, a, imm4 SR[d] ← SR[a] + imm4
LAId, a, imm4 AR[d] ← lvload72(AR[a] +p imm4×8)
LXId, a, imm4 XR[d] ← lvload72(AR[a] +p imm4×8)
LId, a, imm4 SR[d] ← lvload72(AR[a] +p imm4×8)

16‑bit instruction format destination 1 source
15 12 11 8 7 4 3 0
op1da a d op1
4 4 4 4

RTAGd, a AR[d] ← 240 ∥ 056 ∥ AR[a]71..64
RSIZEd, a AR[d] ← 240 ∥ 03 ∥ AR[a]132..72

16‑bit instruction format 2 source
15 12 11 8 7 4 3 0
b a op1ab op1
4 4 4 4

BEQAa, b branch if AR[a] = AR[b]
BEQXa, b branch if XR[a] = XR[b]
BNEAa, b branch if AR[a] ≠ AR[b]
BNEXa, b branch if XR[a] ≠ XR[b]
BLTAUa, b branch if AR[a] <u AR[b]
BLTXUa, b branch if XR[a] <u XR[b]
BGEAUa, b branch if AR[a] ≥u AR[b]
BGEXUa, b branch if XR[a] ≥u XR[b]
BLTXa, b branch if XR[a] <s XR[b]
BGEXa, b branch if XR[a] ≥s XR[b]
BNONEXa, b branch if (XR[a] & XR[b]) = 0
BANYXa, b branch if (XR[a] & XR[b]) ≠ 0
TEQAa, b trap if AR[a] = AR[b]
TEQXa, b trap if XR[a] = XR[b]
TNEAa, b trap if AR[a] ≠ AR[b]
TNEXa, b trap if XR[a] ≠ XR[b]
TLTAUa, b trap if AR[a] <u AR[b]
TLTXUa, b trap if XR[a] <u XR[b]
TGEAUa, b trap if AR[a] ≥u AR[b]
TGEXUa, b trap if XR[a] ≥u XR[b]
TLTXa, b trap if XR[a] <s XR[b]
TGEXa, b trap if XR[a] ≥s XR[b]
TNONEXa, b trap if (XR[a] & XR[b]) = 0
TANYXa, b trap if (XR[a] & XR[b]) ≠ 0

16‑bit instruction format 1 source
15 12 11 8 7 4 3 0
op1a a op1ab op1
4 4 4 4

BEQNAa branch if AR[a]71..64 = 0
BNENAa branch if AR[a]71..64 ≠ 0
BEQZXa branch if XR[a] = 0
BNEZXa branch if XR[a] ≠ 0
BLTZXa branch if XR[a] <s 0
BGEZXa branch if XR[a] ≥s 0
BLEZXa branch if XR[a] ≤s 0
BGTZXa branch if XR[a] >s 0
BFa branch if BR[a] = 0
BTa branch if BR[a] ≠ 0
TEQZXa trap if XR[a] = 0
TNEZXa trap if XR[a] ≠ 0
TLTZXa trap if XR[a] <s 0
TGEZXa trap if XR[a] ≥s 0
TLEZXa trap if XR[a] ≤s 0
TGTZXa trap if XR[a] >s 0
TFa trap if BR[a] = 0
TTa trap if BR[a] ≠ 0
JMPa PC ← AR[a]
SOB XR[a] ← XR[a] − 1
loop back if XR[a] ≠ 0

16‑bit instruction format destination immediate
15 8 7 4 3 0
imm8 d op1
8 4 4

XId, imm8 XR[d] ← 240 ∥ imm8748 ∥ imm8
Id, imm8 SR[d] ← 240 ∥ imm8748 ∥ imm8
32‑bit instruction format 3 sources 1 destination
31 28 27 22 21 20 19 16 15 12 11 8 7 4 3 0
op20 op21 m c b a d op2
4 6 2 4 4 4 4 4

ao1ao2d, c, a, b SR[d] ← SR[c] ao1 (SR[a] ao2 SR[b])
Example ao1 might be: + (ADD) − (SUB)
Example ao2 might be: + (ADD) − (SUB) × (MUL)
FUNd, b, a, c t ← (SR[b]63..0∥SR[a]63..0) >> SR[c]5..0
SR[d] ← 240 ∥ t63..0
FUNNd, b, a, c t ← (SR[b]63..0∥SR[a]63..0) >> (−SR[c])5..0
SR[d] ← 240 ∥ t63..0
fo1fo2.Dd, c, a, b SR[d] ← SR[c] fo1 (SR[a] fo2 SR[b])
Vfo1fo2d, c, a, b, m VR[d] ← VR[c] ao1 (VR[a] ao2 VR[b]) masked by VM[m]
Vfo1fo2d, c, a, b, m VR[d] ← VR[c] ao1 (VR[a] ao2 SR[b]) masked by VM[m]
Vfo1fo2.Dd, c, a, b, m VR[d] ← VR[c] fo1 (VR[a] fo2 VR[b]) masked by VM[m]
Vfo1fo2.Dd, c, a, b, m VR[d] ← VR[c] fo1 (VR[a] fo2 SR[b]) masked by VM[m]
Example fo1 might be: +f (ADD) −f (SUB)
Example fo2 might be: +f (ADD) −f (SUB) ×f (MUL)
bo1bo2d, c, a, b SR[d] ← SR[c] bo1 (SR[a] bo2 SR[b])
Example bo1 might be: & (AND) | (OR) ^ (XOR)
Example bo2 might be: & (AND) | (OR) &~ (ANDC) |~ (ORC) ^ (XOR) ^~ (XORC) << (SLL) >>u (SRL) >>s (SRA)
SELd, c, a, b SR[d] ← BR[c] ? SR[a] : SR[b]
lo1lo2d, c, a, b BR[d] ← BR[c] lo1 (BR[a] lo2 BR[b])
lo1copAd, c, a, b BR[d] ← BR[c] lo1 (AR[a] cop AR[b])
lo1copXd, c, a, b BR[d] ← BR[c] lo1 (XR[a] cop XR[b])
lo1copd, c, a, b BR[d] ← BR[c] lo1 (SR[a] cop SR[b])
Vlo1copd, c, a, b VM[d] ← VM[c] lo1 (VR[a] cop VR[b])
Vlo1copd, c, a, b VM[d] ← VM[c] lo1 (VR[a] cop SR[b])

32‑bit instruction format with 2 sources 1 destination and 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op20 i c i a d op2
4 8 4 4 4 4 4

ao1ao2Id, a, b, imm SR[d] ← SR[c] ao1 (SR[a] ao2 imm12)
bo1bo2Id, a, b, imm SR[d] ← SR[c] bo1 (SR[a] bo2 imm12)
SELId, c, a, imm12 SR[d] ← BR[c] ? SR[a] : imm12
lo1copId, a, b, imm BR[d] ← BR[c] lo1 (AR[a] cop imm12)
lo1copId, a, b, imm BR[d] ← BR[c] lo1 (SR[a] cop imm12)
Vlo1copId, a, b, imm VM[d] ← VM[a] lo1 (VR[b] cop imm12)
BR[0] is hardwired to 0. Using BR[0] as a destination acts as negative assertion, taking an exception if the value computed is 1.
Example lo1/lo2: & (AND) | (OR) ^ (XOR) &~ (ANDC) |~ (ORC) ^~ (XORC)
Example cop: = (EQ) ≠ (NE) <u (LTU) <s (LT) ≥u (GEU) ≥u (GE) tag= tag≠ tag< tag≥ word= word≠ word< word≥

32‑bit instruction format 2 sources 1 destination
31 28 27 22 21 16 15 12 11 8 7 4 3 0
op20 op21 op23 b a d op2
4 6 6 4 4 4 4

op0Xd, a, b XR[d] ← XR[a] op0 XR[b]
Example op0 might be: + (ADD) − (SUB) << (SLL) >>u (SRL) >>s (SRA)
Possible op0 might include: minu mins, maxu maxs
ao2d, b, c SR[d] ← SR[a] ao2 SR[b]
LX32Ud, a, b t ← lvload32(AR[a] +p XR[b]×4)
XR[d] ← 240 ∥ 032 ∥ t
L32Ud, a, b t ← lvload32(AR[a] +p XR[b]×4)
SR[d] ← 240 ∥ 032 ∥ t
LX32Sd, a, b t ← lvload32(AR[a] +p XR[b]×4)
XR[d] ← 240 ∥ t3132 ∥ t
L32Sd, a, b t ← lvload32(AR[a] +p XR[b]×4)
SR[d] ← 240 ∥ t3132 ∥ t
LX16Ud, a, b t ← lvload16(AR[a] +p XR[b]×2)
XR[d] ← 240 ∥ 048 ∥ t
L16Ud, a, b t ← lvload16(AR[a] +p XR[b]×2)
SR[d] ← 240 ∥ 048 ∥ t
LX16Sd, a, b t ← lvload16(AR[a] +p XR[b]×2)
XR[d] ← 240 ∥ t1548 ∥ t
L16Sd, a, b t ← lvload16(AR[a] +p XR[b]×2)
SR[d] ← 240 ∥ t1548 ∥ t
LX8Ud, a, b t ← lvload8(AR[a] +p XR[b])
XR[d] ← 240 ∥ 056 ∥ t
L8Ud, a, b t ← lvload8(AR[a] +p XR[b])
SR[d] ← 240 ∥ 056 ∥ t
LX8Sd, a, b t ← lvload8(AR[a] +p XR[b])
XR[d] ← 240 ∥ t756 ∥ t
L8Sd, a, b t ← lvload8(AR[a] +p XR[b])
SR[d] ← 240 ∥ t756 ∥ t

32‑bit instruction format with 1 source 1 destination and 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op20 i op24 i a d op2
4 8 4 4 4 4 4

op0AId, a, imm AR[d] ← AR[a] op0 imm12
op0XId, a, imm XR[d] ← XR[a] op0 imm12
ao2Id, b, imm SR[d] ← SR[a] ao2 imm12
LAId, a, imm AR[d] ← lvload72(AR[a] +p imm12×8)
LXId, a, imm XR[d] ← lvload72(AR[a] +p imm12×8)
LId, a, imm SR[d] ← lvload72(AR[a] +p imm12×8)
LX32UId, a, imm t ← lvload32(AR[a] +p imm12×4)
XR[d] ← 240 ∥ 032 ∥ t
L32UId, a, imm t ← lvload32(AR[a] +p imm12×4)
SR[d] ← 240 ∥ 032 ∥ t
LX32SId, a, imm t ← lvload32(AR[a] +p imm12×4)
XR[d] ← 240 ∥ t3132 ∥ t
L32SId, a, imm t ← lvload32(AR[a] +p imm12×4)
SR[d] ← 240 ∥ t3132 ∥ t
LX16UId, a, imm t ← lvload16(AR[a] +p imm12×2)
XR[d] ← 240 ∥ 048 ∥ t
L16UId, a, imm t ← lvload16(AR[a] +p imm12×2)
SR[d] ← 240 ∥ 048 ∥ t
LX16SId, a, imm t ← lvload16(AR[a] +p imm12×2)
XR[d] ← 240 ∥ t1548 ∥ t
L16SId, a, imm t ← lvload16(AR[a] +p imm12×2)
SR[d] ← 240 ∥ t1548 ∥ t
LX8UId, a, imm t ← lvload8(AR[a] +p imm12)
XR[d] ← 240 ∥ 056 ∥ t
L8UId, a, imm t ← lvload8(AR[a] +p imm12)
SR[d] ← 240 ∥ 056 ∥ t
LX8SId, a, imm t ← lvload8(AR[a] +p imm12)
XR[d] ← 240 ∥ t756 ∥ t
L8SId, a, imm t ← lvload8(AR[a] +p imm12)
SR[d] ← 240 ∥ t756 ∥ t
LOOPXd XR[d] ← XR[a] − XR[b]
LOOPXId XR[d] ← XR[a] + imm12
MOVASd, a AR[d] ← SR[a]
MOVSAd, a SR[d] ← AR[a]
MOVABd, a AR[d] ← 240 ∥ 063 ∥ BR[a]
MOVBAd, a, imm6 BR[d] ← AR[a]imm6
MOVSBd, a SR[d] ← 240 ∥ 063 ∥ BR[a]
MOVBSd, a, imm6 BR[d] ← SR[a]imm6
MOVSBALLd SR[d] ← 240 ∥ 048 ∥ BR[15]∥BR[14]∥…∥BR[1]∥0
MOVSVMd, m, w SR[d] ← 240 ∥ VM[m]w×64+63..w×64
MOVVMSd, a, w VM[d]w×64+63..w×64 ← SR[a]

32‑bit instruction format with 2 sources 1 destination and 6‑bit immediate
31 28 27 22 21 16 15 12 11 8 7 4 3 0
op20 op21 imm6 b a d op2
4 6 6 4 4 4 4

FUNId, a, b, i t ← (SR[b]63..0∥SR[a]63..0) >> imm6
SR[d] ← 240 ∥ t63..0

32‑bit instruction format 3 sources 0 destination
31 28 27 22 21 20 19 16 15 12 11 8 7 4 3 0
op20 op21 m c b a op22 op2
4 6 2 4 4 4 4 4

SAc, a, b lvstore72(AR[a] +p XR[b]×8) ← AR[c]
SXc, a, b lvstore72(AR[a] +p XR[b]×8) ← XR[c]
Sc, a, b lvstore72(AR[a] +p XR[b]×8) ← SR[c]
SX32c, a, b lvstore32(AR[a] +p XR[b]×4) ← XR[c]31..0
S32c, a, b lvstore32(AR[a] +p XR[b]×4) ← SR[c]31..0
SX16c, a, b lvstore16(AR[a] +p XR[b]×2) ← XR[c]15..0
S16c, a, b lvstore16(AR[a] +p XR[b]×2) ← SR[c]15..0
SX8c, a, b lvstore8(AR[a] +p XR[b]) ← XR[c]7..0
S8c, a, b lvstore8(AR[a] +p XR[b]) ← SR[c]7..0
Blo2a, b branch if BR[a] lo2 BR[b]
(equivalent to BORlo2 b0, a, b)
BEQAa, b branch if AR[a] = AR[b]
(equivalent to BOREQA b0, a, b)
BEQXa, b branch if XR[a] = XR[b]
BNEAa, b branch if AR[a] ≠ AR[b]
BNEXa, b branch if XR[a] ≠ XR[b]
BLTAUa, b branch if AR[a] <u AR[b]
BLTXUa, b branch if XR[a] <u XR[b]
BGEAUa, b branch if AR[a] ≥u AR[b]
BGEXUa, b branch if XR[a] ≥u XR[b]
BLTXa, b branch if XR[a] <s XR[b]
BGEXa, b branch if XR[a] ≥s XR[b]
BNONEXa, b branch if (XR[a] & XR[b]) = 0
BANYXa, b branch if (XR[a] & XR[b]) ≠ 0
Blo1lo2c, a, b branch if BR[c] lo1 (BR[a] lo2 BR[b])
Blo1EQAc, a, b branch if BR[c] lo1 (AR[a] = AR[b])
Blo1EQXc, a, b branch if BR[c] lo1 (XR[a] = XR[b])
Blo1NEAc, a, b branch if BR[c] lo1 (AR[a] ≠ AR[b])
Blo1NEXc, a, b branch if BR[c] lo1 (XR[a] ≠ XR[b])
Blo1LTAUc, a, b branch if BR[c] lo1 (AR[a] <u AR[b])
Blo1LTXUc, a, b branch if BR[c] lo1 (XR[a] <u XR[b])
Blo1GEAUc, a, b branch if BR[c] lo1 (AR[a] ≥u AR[b])
Blo1GEXUc, a, b branch if BR[c] lo1 (XR[a] ≥u XR[b])
Blo1LTXc, a, b branch if BR[c] lo1 (XR[a] <s XR[b])
Blo1GEXc, a, b branch if BR[c] lo1 (XR[a] ≥s XR[b])
Blo1NONEXc, a, b branch if BR[c] lo1 ((XR[a] & XR[b]) = 0)
Blo1ANYXc, a, b branch if BR[c] lo1 ((XR[a] & XR[b]) ≠ 0)

32‑bit instruction format 2 sources 0 destination with 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op20 i c i a op22 op2
4 8 4 4 4 4 4

SAIc, a, imm lvstore72(AR[a] +p imm12×8) ← AR[c]
SIc, a, imm lvstore72(AR[a] +p imm12×8) ← SR[c]
SA32Ic, a, imm lvstore32(AR[a] +p imm12×4) ← AR[c]31..0
S32Ic, a, imm lvstore32(AR[a] +p imm12×4) ← SR[c]31..0
SA16Ic, a, imm lvstore16(AR[a] +p imm12×2) ← AR[c]15..0
S16Ic, a, imm lvstore16(AR[a] +p imm12×2) ← SR[c]15..0
SA8Ic, a, imm lvstore8(AR[a] +p imm12) ← AR[c]7..0
S8Ic, a, imm lvstore8(AR[a] +p imm12) ← SR[c]7..0
BEQXIa, imm12 branch if XR[a] = imm12
(equivalent to BOREQXI b0, a, imm12)
BNEXIa, imm12 branch if XR[a] ≠ imm12
BLTXUIa, imm12 branch if XR[a] <u imm12
BGEXUIa, imm12 branch if XR[a] ≥u imm12
BLTXIa, imm12 branch if XR[a] <s imm12
BGEXIa, imm12 branch if XR[a] ≥s imm12
BNONEXIa, imm12 branch if (XR[a] & imm12) = 0
BANYXIa, imm12 branch if (XR[a] & imm12) ≠ 0)
Blo1XEQIc, b, imm12 branch if BR[c] lo1 (XR[a] = imm12)
Blo1XNEIc, a, imm12 branch if BR[c] lo1 (XR[a] ≠ imm12)
Blo1XLTUIc, a, imm12 branch if BR[c] lo1 (XR[a] <u imm12)
Blo1XGEUIc, a, imm12 branch if BR[c] lo1 (XR[a] ≥u imm12)
Blo1XLTIc, a, imm12 branch if BR[c] lo1 (XR[a] <s imm12)
Blo1XGEIc, a, imm12 branch if BR[c] lo1 (XR[a] ≥s imm12)
Blo1XNONEIc, a, imm12 branch if BR[c] lo1 ((XR[a] & imm12) = 0)
Blo1XANYIc, a, imm12 branch if BR[c] lo1 ((XR[a] & imm12) ≠ 0)
SWITCHRb PC ← PC +p (XR[b]×8)
Used when all cases are in the current bage.
SWITCHIa, imm12 PC ← AR[a] +p (imm12×8)
SWITCHa, b PC ← AR[a] +p (XR[b]×8)
LJMPIa, imm12 PC ← lvload72(AR[a] +p imm12×8)
LJMPa, b PC ← lvload72(AR[a] +p XR[b]×8)

32‑bit instruction format with 1 source 1 destination and 16‑bit immediate
31 28 27 12 11 8 7 4 3 0
op20 imm16 a d op2
4 16 4 4 4

ALLOCId, a, imm7 AR[d] ← 051∥imm7∥03
∥ imm7
∥ max(AR[a]63..61, PC63..61)
∥ (AR[a]60..0 + AR[a]132..72)
ALLOCId, a, imm16 lvstore72(AR[a] + 8) ← 254∥048∥imm16
AR[d] ← −(048∥imm16)
∥ 254
∥ max(AR[a]63..61, PC63..61)
∥ (AR[a]60..0+AR[a]132..72+16)
Primarily used for allocating stack frames with a15:
ALLOCIsp, sp, imm

32‑bit instruction format with 24‑bit immediate
31 8 7 4 3 0
imm24 d op2
24 4 4

XId, imm XR[d] ← 240 ∥ imm242340∥imm24
Id, imm SR[d] ← 240 ∥ imm242340∥imm24

Software Conventions

Data Types

I expect SecureRISC software to use the ILP64 model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).

Register Names and Uses

Direct Mapping and Paging

Translation of local virtual addresses to system interconnect addresses is typically performed in a single processor cycle in one of several L1 TLBs, which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a more lengthy procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described above, SecureRISC uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size).

SecureRISC segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits, and leaving the low bits (below the segment size) unchanged.

Paging

Paging in SecureRISC takes advantage of segment sizes to be more efficient than in some ISAs and also supports multiple page sizes. The proposal for SecureRISC here is that each segment has a page size that is used for all subsequent levels, but this could be generalized to allow a programmable size at each level at the cost of complexity in hardware. My current thoughts on those page sizes are 4 KiB, 16 KiB, and 1 MiB, but which sizes would eventually be supported is a matter for evaluation. Only the last level page size affects the TLB in many micro-architectures, so page size at earlier levels is primarily a question of complexity in the hardware table walk. Supporting multiple page sizes in TLBs is costly, and should be done in a limited way, and I am concerned even about proposing three sizes.

Aside: 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back in the 1960s as the trade-off between the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.

Aside: RISC-V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads.

Address Space Identifiers (ASIDs) for TLB Sharing

TLBs introduce one other complication. Typically when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in the MIPS translation hardware. The ASID is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous ASID, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, but also avoids the TLB flush. When the ASIDs are used up, the TLB is flushed, and then ASID assignment starts fresh as processes are run. For example, a 5‑bit ASID would then require a TLB flush only when the 33rd distinct process is run after the last flush. The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the ASID during the match process; such entries match any ASID. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the ASIDs in the TLB. RISC-V for example introduced a VMID controlled by the hypervisor that works analogously to the ASID.

SecureRISC needs an ASID mechanism and a way to ignore for the same reason as in other ISAs. The question is whether this mechanism needs to be generalized, just as rings are a generalization of of supervisor and user mode. I propose one such possible generalization with eight possible sharing opportunities, but whether this is required may be reevaluated. Perhaps SecureRISC will revert to a simple Global bit or just ASID=0 to mean Global. There is no particular reason to choose eight. Below is the mechanism proposed. Again, we expect that various service levels in the system will have some segments common to all of the service levels that they support, and that these should require only a single TLB entry, but that other segments might be change their translation for each supported service level.

Segment Descriptor Table Pointer Registers and ASIDs

The simplest implementation for a Segment Descriptor Table (SDT) is to have a single Segment Descriptor Table Pointer (SDTP) register and use a Global bit in PTEs. My alternative ASID generalization is to groups segments into eight groups (SG), and give each group its own SDT, as addressed by eight SDTP registers. These eight registers are then the zero level table, followed by the chosen Segment Descriptor Table (the first level), followed by zero to four levels of page table. Since the registers are not in memory, there are one to five levels of memory tables to walk starting with the Segment Descriptor Table. The segment size in the SDT allows the length of the walk to be per-segment, so most code segments (e.g. shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped, and require only the SDT level of mapping. In addition if the hypervisor is not paging the supervisor, it might direct map many supervisor segments.

Here are the details. After a TLB miss, the processor starts by using the 3 high bits of the segment field of the address to pick one of eight Segment Descriptor Table registers (sdtp[0] to sdtp[7]). The low 10 bits of the segment field are then an index into the table at the system virtual address in the specified register. The size field of the sdtp registers is used to bounds check the low 10 bits of the segment number before indexing, which allows each portion of the Segment Descriptor Table to be 0, 256, 512, 768, or 1024 entries in multiples of 256 entries (4 KiB to 16 KiB in multiples of 4 KiB). A size field of 0 disables the segment group; otherwise, the check is that svaddr57..56 < satp[svaddr60..58]11..9. If the bound check succeeds, the doubleword Segment Descriptor Entry is read from (satp[svaddr60..58]63..12 ∥ 012) | (svaddr57..48 ∥ 04) and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Address Space Identifier (ASID) specified in bits 8..0 the selected sdtp.

This method can be used to provide the functionality of two levels of other architectures (i.e. supervisor common using Global=1 and per-process using Global=0). A SecureRISC supervisor might simply use 256-1024 segments for supervisor common (with ASID=0), and 256-1024 segments the other segments for per-process mappings with dynamically assigned ASIDs as they are run. Such a system might set sdtp[7] at initialization, change sdtp[0] on process switch, and leave the other six groups unused (size=0).

Segment Descriptor Table Pointer registers are only readable and writable by ring 0. Other rings must use ring 0 calls to read and write these registers.

Segment Descriptor Table Pointers
71 64 63 12 11 9 8 0
240 svaddress63..12 size ASID
8 52 3 9

Segment Descriptor Entry Word 0
71 64 63 33 32 31 30 29 28 24 23 21 20 18 17 15 14 12 11 10 9 8 7 6 5 0
240 0 G1 G0 PS 0 R3 R2 R1 0 C P X W R size
8 31 2 2 5 3 3 3 3 1 1 1 1 1 1 6

Fields of Segment Descriptor Entries (SDEs)
Field(s) Width Description
size 6 Segment size is 2size bytes. Value 0 indicates an invalid segment (or should there be a V bit?). Values 1..11 are reserved.
R 1 Read permission
W 1 Write permission
X 1 Execute permission
P 1 Pointer permission (pointers with segment numbers are permitted)
C 1 CHERI Capability permission
R1, R2, R3 3 Ring brackets as described elsewhere.
PS 5 Page size:
04KiB
1reserved
216KiB
3 to 7reserved
81MiB
9 to 30reserved
31direct mapped, unpaged
G0 2 Generation number of this segment for GC.
G1 2 Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G to 3.

Segment Descriptor Entry Word 1
71 64 63 12 11 0
240 svaddress63..12 0
8 52 12

The interpretation of the system virtual address in SDE word 1 depends on the PS field in SDE word 0. For direct mapping (PS=31), it is simply the high bits of the System Virtual Address to combine with the segment offset, and bits size-1..0 must be zero. For paging, it is the address of the first-level page table, and bits PS+11..0 must be zero.

Local Virtual Address Direct Mapping

For direct mapping, the segment mapping consists of:

  1. Checking that the offset is not out of bounds for segments < 248 bytes or clearing bits 60..size for segments ≥ 248 bytes.
  2. Checking that the mapping is aligned to the segment size.
  3. Oring the offset with the mapping. The two checks above ensures that the OR is never sees two ones in the same bit position.

For segments ≤ 248 bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero, or equivalently that svaddr47..0 < 2size. For segments > 248 bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before oring. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 248 bytes. Thus paging does not need to handle segments larger than 248 bytes (the SDT for such segments is in effect the first level of the page table).

Local Virtual Address Paging

When paging is used, the page tables can be one to four levels deep, with each level after the first using the same page size. The first level uses some fraction of the specified page size depending on the segment size. The hypervisor and supervisors are free to allocate only as much memory as needed for the first level page table. The following tables provide examples of how the local virtual address is used to index levels of the page table for several page and segment sizes. In the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.

Local Virtual Address with 4 KiB page size and 248 segment size — 4‑level page table
60 58 57 48 47 39 38 30 29 21 20 12 11 0
SG SEG V1 V2 V3 V4 offset
3 10 9 9 9 9 12

Local Virtual Address with 4 KiB page size and 230 segment size — 2‑level page table
60 58 57 48 47 30 29 21 20 12 11 0
SG SEG 0 V1 V2 offset
3 10 18 9 9 12

Local Virtual Address with 16 KiB page size and 248 segment size — 4‑level page table
60 58 57 48 47 46 36 35 25 24 14 13 0
SG SEG V1 V2 V3 V4 offset
3 10 1 11 11 11 14

Local Virtual Address with 1 MiB page size and 248 segment size — 2‑level page table
60 58 57 48 47 37 36 20 19 0
SG SEG V1 V2 offset
3 10 11 17 20

The format of a segment page table is multiple levels, each level consisting of 72‑bit words with integer tags in the following format:

Page Table Entry (PTE)
71 64 63 12 11 8 7 6 5 4 3 2 1 0
240 svaddress63..12 S G D A X W R V
8 52 4 2 1 1 1 1 1 1

Segments are meant as the unit of access control, but including Read, Write, and Execute permissions in the PTE might make ports of less aware operating systems easier.

Fields of Page Table Entries (PTEs)
Field(s) Width Description
V 1 Valid:
0 ⇒ invalid, bits 63..1 available for software
1 ⇒ valid, bits 63..1 as described below
R 1 Read permission
W 1 Write permission
X 1 Execute permission
A 1 Accessed:
0 ⇒ trap on any access (software sets A to continue)
1 ⇒ access allowed
D 1 Dirty:
0 ⇒ trap on any write (software sets D to continue)
1 ⇒ writes allowed
G 2 Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the G field. This feature is turned off by setting G to 3.
S 4 For software use
svaddress63..12 52 For last level of page table, this is the translation
For earlier levels, this is the pointer to the next level

System Virtual to System Interconnect Address Mapping

After 61‑bit Local Virtual Addresses are mapped to 64‑bit System Virtual Addresses, these 64‑bit System Virtual Addresses are mapped to 64‑bit System Interconnect Addresses. This mapping is similar, but not identical to the mapping above as it starts with a 14‑bit region number rather than a 13‑bit segment number. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. For the maximum 16,384 regions, with 16 bytes for a RDT entry, the maximum size RDT is 256 KiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).

The format of the Region Descriptor Entries is a simplified version of Segment Descriptor Entries as shown below.

Region Descriptor Entry Word 0
71 64 63 26 25 14 13 12 11 10 6 5 0
240 0 NDA C W R PS size
8 38 12 1 1 1 5 6

Region Descriptor Entry Word 1
71 64 63 4 3 0
240 System Interconnect Address 0
8 60 4

The format of a region page table is mulitple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.

TLB Flushing

Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it is possible that L2 Data Cache might note that the TLB contains entries from the line, and send an invalidate to the TLB when the L2 line is invalidated. This might avoid the need for some TLB flushes. However, this requires the L2 to store the TLB location, which might require 8 bits per L2 tag. It is unclear whether this is worthwhile.

Region Protection

Ports into the system interconnect are limited in which regions they are permitted to access. The exact mechanism is TBD.

Memory Encryption

An optional system feature of RDEs is to specify that the contents of the memory of the region should be protected with data at rest encryption. A separate table (perhaps in a secure enclave) would give the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system virtual address would be used as the tweak. An obvious possibility is a 144‑bit block size cipher (e.g. a variant of AES based upon 9‑bit S‑boxes) used in Galois/Counter Mode (GCM), resulting in a 144‑bit authentication code, which would be stored in memory with the block. For SecureRISC, with a cache line size of 8 words of 72 bits, this results in 576‑bit entities for data at rest protection, which becomes 720 bits with the authentication code, or eight 90‑bit words, which would be ECC protected with 8 check bits, producing a 98‑bit memory word. This would be an unusual width for standard DRAMs, and 9 ECC bits per 180 would also be unusual. Instead consider the 576 bits to be 9 words of 64 bits, use a more standard 128‑bit block cipher (e.g. standard AES) nine times, add the 128‑bit authentication, resulting in 704 bits, or eight words of 88 bits. Adding 8 bits of ECC results in 96 bits per memory word, which might use three 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 GCM xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Regenerating ECC for the decrypted data for writing into the L2 cache can be done by also precomputing the 64 bits to xor with the 8 ECC codes. Only if an ECC error is detected and corrected is it necessary to recompute the ECC before writing into the L2 cache. Writes would incur the GCM computation latency (primarily nine AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).

An Example Micro-architecture

Structure Description
Basic Block Descriptor Fetch
Predicted PC 64‑bit lvaddr and ring
Predicted loop iteration 64‑bit integer
Predicted CSP 64‑bit lvaddr and ring
L1 BB Descriptor TLB 32 entry, 8‑way set associative,
mapping lvaddr61..12 to siaddr63..12 in parallel with BB Descriptor Cache,
filled from L2 Descriptor/Instruction TLB
L2 BB Descriptor TLB 256 entry, 8‑way set associative,
filled from L2 Data Cache
BB Descriptor Cache 32 KiB (4096 descriptors), 8‑way set associative,
64‑byte line size, 8‑byte read, 64‑byte write,
lvaddr11..3 index, ?siaddr35..12 tag?,
1.5 cycles latency, 2 cycles to predicted PC,
filled from L2 Descriptor/Instruction Cache on miss and by prefetch
Next Descriptor Index Predictor 32×10+12, direct mapped
lvaddr7..3 index, lvaddr19..8 tag,
1 cycle to predicted BB Descriptor Cache index,
most recent flow change hits from BB Descriptor Cache
Return Address Prediction 64-entry (512 B)
Branch Predictor ~16 KiB BATAGE
Indirect Jump/Call Predictor ~16 KiB ITTAGE?
BB Fetch Output 8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr63..2, instruction start mask, prediction to check
Instruction Fetch
L1 Instruction Cache 64 KiB, 4‑way set associative, 64‑byte line, read, write
siaddr13..4 index, siaddr63..14 tag,
2-cycle latency, use 0*-2 times per basic block descriptor, so 0 or 2-3 cycles for entire BB instruction fetch,
filled from L2 Descriptor/Instruction Cache on miss and prefetch,
experiment with prefetch on BB descriptor fill
* 0 fetches required if the previous 64B fetch covers the current one
L2 Fetch
L2 Combined Descriptor/Instruction Cache 512 KiB, 8‑way set associative, 64‑byte line, read, write,
siaddr15..6 index, siaddr63..16 tag,
filled from system interconnect or L3 on miss and prefetch, evictions to L3
Instruction Fetch Output 32‑entry Instruction Queue of 50‑bit decoded instructions
(16‑bit and 32‑bit instructions expanded)
AR Execution Unit
PC, CSP Committed values
Register renaming for ARs 16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list.
Register renaming for XRs 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list.
Register renaming for BRs 16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list.
Register renaming for SRs 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list.
(VRs are not renamed)
AR physical register file 64×144 (+ parity) 6‑read, 4‑write
XR physical register file 64×72 (+ parity) 6‑read, 4‑write
L1 Data TLB 64 entry, 8‑way set associative,
mapping lvaddr to siaddr, filled from L2 Data TLB
L2 Data TLB 256 entry, 8‑way set associative,
filled from L2 Data Cache
L1 Data Cache 32 KiB, 4‑way set associative, 64‑byte line, 16‑byte read, 64‑byte write,
lvaddr index, siaddr tag, write-thru,
filled from L2 Data Cache on miss or prefetch
Return Address Stack Cache 64-entry (512 B), 64‑byte line size, no tags, fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache
L2 Data Cache 512 KiB, 8‑way set associative, 64‑byte line, read, write,
siaddr15..6 index, siaddr63..16 tag, write-back,
filled from system interconnect or L3 on miss or prefetch, eviction to L3
AR Engine Output 64‑entry BR/SR/VR operation queue
SR/VR Execution Unit
(tends to run about a L2 Data Cache latency behind the AR Execution Unit)
BR physical register file 64×1 6‑read, 2‑write
SR physical register file 64×72 (+ parity) 8‑read, 4‑write
VR register file 16×72×128 (+ parity) 4‑read, 2‑write
Combined Fetch/Data
System virtual address TLB 128 entry, 8‑way set associative,
mapping system virtual addresses to system interconnect addresses
(maintained by hypervisor)
L3 Eviction Cache serving L2 Instruction and L2 Data caches 8 MiB, 8‑way set associative, 64‑byte line size, non-inclusive, plus 8‑way set associative directory for sub caches,
filled from evictions from L2 Instruction and Data caches

Questions and Things Still Undecided

Tag Summary

Tag Use
0..183 Sized pointers
184..191 Unsized pointers
192 Pointer to Basic Block Descriptor
193 CHERI Capability
194..223 Reserved
224 Lisp CONS
225 Lisp Function
226 Lisp Symbol
227 Lisp/Julia Structure
228..229 Reserved
230 Lisp Array
231 Lisp Vector
232 Lisp String
233 Lisp Bit-vector
234 Lisp Ratio, Julia Rational
235 Lisp/Julia Complex
236 Lisp Bigfloat
237 Lisp Bignum
238 128‑bit integer
239 128‑bit unsigned integer
240 64‑bit integer
241 64‑bit unsigned integer
242 Small integer types
243 Reserved
244 Double-precision floating-point
245 8, 16, and 32‑bit floating-point
246..251 Reserved
252 Bits 143..136 of AR doublword store (used for save/restore and CHERI capabilities)
253 Basic Block Descriptor
254 Size header/trailer words
255 Trap on load or store
Valid XHTML 1.0 Transitional Valid CSS!
Earl Killian <webmaster at killian.com>
No Junk Email!
SecureRISC/index.html 2022-08-16