Physical Memory¶
Linux is available for a wide range of architectures so there is a need for an architecture-independent abstraction to represent the physical memory. This chapter describes the structures used to manage physical memory in a running system.
The first principal concept prevalent in the memory management is Non-Uniform Memory Access (NUMA). With multi-core and multi-socket machines, memory may be arranged into banks that incur a different cost to access depending on the “distance” from the processor. For example, there might be a bank of memory assigned to each CPU or a bank of memory very suitable for DMA near peripheral devices.
Each bank is called a node and the concept is represented under Linux by a
struct pglist_data even if the architecture is UMA. This structure is
always referenced by its typedef pg_data_t. A pg_data_t structure
for a particular node can be referenced by NODE_DATA(nid) macro where
nid is the ID of that node.
For NUMA architectures, the node structures are allocated by the architecture
specific code early during boot. Usually, these structures are allocated
locally on the memory bank they represent. For UMA architectures, only one
static pg_data_t structure called contig_page_data is used. Nodes will
be discussed further in Section Nodes
The entire physical address space is partitioned into one or more blocks
called zones which represent ranges within memory. These ranges are usually
determined by architectural constraints for accessing the physical memory.
The memory range within a node that corresponds to a particular zone is
described by a struct zone. Each zone has
one of the types described below.
ZONE_DMAandZONE_DMA32historically represented memory suitable for DMA by peripheral devices that cannot access all of the addressable memory. For many years there are better more and robust interfaces to get memory with DMA specific requirements (Dynamic DMA mapping using the generic device), butZONE_DMAandZONE_DMA32still represent memory ranges that have restrictions on how they can be accessed. Depending on the architecture, either of these zone types or even they both can be disabled at build time usingCONFIG_ZONE_DMAandCONFIG_ZONE_DMA32configuration options. Some 64-bit platforms may need both zones as they support peripherals with different DMA addressing limitations.ZONE_NORMALis for normal memory that can be accessed by the kernel all the time. DMA operations can be performed on pages in this zone if the DMA devices support transfers to all addressable memory.ZONE_NORMALis always enabled.ZONE_HIGHMEMis the part of the physical memory that is not covered by a permanent mapping in the kernel page tables. The memory in this zone is only accessible to the kernel using temporary mappings. This zone is available only on some 32-bit architectures and is enabled withCONFIG_HIGHMEM.ZONE_MOVABLEis for normal accessible memory, just likeZONE_NORMAL. The difference is that the contents of most pages inZONE_MOVABLEis movable. That means that while virtual addresses of these pages do not change, their content may move between different physical pages. OftenZONE_MOVABLEis populated during memory hotplug, but it may be also populated on boot using one ofkernelcore,movablecoreandmovable_nodekernel command line parameters. See Page migration and Memory Hot(Un)Plug for additional details.ZONE_DEVICErepresents memory residing on devices such as PMEM and GPU. It has different characteristics than RAM zone types and it exists to provide struct page and memory map services for device driver identified physical address ranges.ZONE_DEVICEis enabled with configuration optionCONFIG_ZONE_DEVICE.
It is important to note that many kernel operations can only take place using
ZONE_NORMAL so it is the most performance critical zone. Zones are
discussed further in Section Zones.
The relation between node and zone extents is determined by the physical memory map reported by the firmware, architectural constraints for memory addressing and certain parameters in the kernel command line.
For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
entire memory will be on node 0 and there will be three zones: ZONE_DMA,
ZONE_NORMAL and ZONE_HIGHMEM:
0 2G
+-------------------------------------------------------------+
| node 0 |
+-------------------------------------------------------------+
0 16M 896M 2G
+----------+-----------------------+--------------------------+
| ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
+----------+-----------------------+--------------------------+
With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and
booted with movablecore=80% parameter on an arm64 machine with 16 Gbytes of
RAM equally split between two nodes, there will be ZONE_DMA32,
ZONE_NORMAL and ZONE_MOVABLE on node 0, and ZONE_NORMAL and
ZONE_MOVABLE on node 1:
1G 9G 17G
+--------------------------------+ +--------------------------+
| node 0 | | node 1 |
+--------------------------------+ +--------------------------+
1G 4G 4200M 9G 9320M 17G
+---------+----------+-----------+ +------------+-------------+
| DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
+---------+----------+-----------+ +------------+-------------+
Memory banks may belong to interleaving nodes. In the example below an x86 machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0 and odd banks belong to node 1:
0 4G 8G 12G 16G
+-------------+ +-------------+ +-------------+ +-------------+
| node 0 | | node 1 | | node 0 | | node 1 |
+-------------+ +-------------+ +-------------+ +-------------+
0 16M 4G
+-----+-------+ +-------------+ +-------------+ +-------------+
| DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL |
+-----+-------+ +-------------+ +-------------+ +-------------+
In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from 4 to 16 Gbytes.
Nodes¶
As we have mentioned, each node in memory is described by a pg_data_t which
is a typedef for a struct pglist_data. When allocating a page, by default
Linux uses a node-local allocation policy to allocate memory from the node
closest to the running CPU. As processes tend to run on the same CPU, it is
likely the memory from the current node will be used. The allocation policy can
be controlled by users as described in
NUMA Memory Policy.
Most NUMA architectures maintain an array of pointers to the node structures. The actual structures are allocated early during boot when architecture specific code parses the physical memory map reported by the firmware. The bulk of the node initialization happens slightly later in the boot process by free_area_init() function, described later in Section Initialization.
Along with the node structures, kernel maintains an array of nodemask_t
bitmasks called node_states. Each bitmask in this array represents a set of
nodes with particular properties as defined by enum node_states:
N_POSSIBLEThe node could become online at some point.
N_ONLINEThe node is online.
N_NORMAL_MEMORYThe node has regular memory.
N_HIGH_MEMORYThe node has regular or high memory. When
CONFIG_HIGHMEMis disabled aliased toN_NORMAL_MEMORY.N_MEMORYThe node has memory(regular, high, movable)
N_CPUThe node has one or more CPUs
For each node that has a property described above, the bit corresponding to the
node ID in the node_states[<property>] bitmask is set.
For example, for node 2 with normal memory and CPUs, bit 2 will be set in
node_states[N_POSSIBLE]
node_states[N_ONLINE]
node_states[N_NORMAL_MEMORY]
node_states[N_HIGH_MEMORY]
node_states[N_MEMORY]
node_states[N_CPU]
For various operations possible with nodemasks please refer to
include/linux/nodemask.h.
Among other things, nodemasks are used to provide macros for node traversal,
namely for_each_node() and for_each_online_node().
For instance, to call a function foo() for each online node:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
foo(pgdat);
}
Node structure¶
The nodes structure struct pglist_data is declared in
include/linux/mmzone.h. Here we briefly describe fields of this
structure:
General¶
node_zonesThe zones for this node. Not all of the zones may be populated, but it is the full list. It is referenced by this node’s node_zonelists as well as other node’s node_zonelists.
node_zonelistsThe list of all zones in all nodes. This list defines the order of zones that allocations are preferred from. The
node_zonelistsis set up bybuild_zonelists()inmm/page_alloc.cduring the initialization of core memory management structures.nr_zonesNumber of populated zones in this node.
node_mem_mapFor UMA systems that use FLATMEM memory model the 0’s node
node_mem_mapis array of struct pages representing each physical frame.node_page_extFor UMA systems that use FLATMEM memory model the 0’s node
node_page_extis array of extensions of struct pages. Available only in the kernels built withCONFIG_PAGE_EXTENSIONenabled.node_start_pfnThe page frame number of the starting page frame in this node.
node_present_pagesTotal number of physical pages present in this node.
node_spanned_pagesTotal size of physical page range, including holes.
node_size_lockA lock that protects the fields defining the node extents. Only defined when at least one of
CONFIG_MEMORY_HOTPLUGorCONFIG_DEFERRED_STRUCT_PAGE_INITconfiguration options are enabled.pgdat_resize_lock()andpgdat_resize_unlock()are provided to manipulatenode_size_lockwithout checking forCONFIG_MEMORY_HOTPLUGorCONFIG_DEFERRED_STRUCT_PAGE_INIT.node_idThe Node ID (NID) of the node, starts at 0.
totalreserve_pagesThis is a per-node reserve of pages that are not available to userspace allocations.
first_deferred_pfnIf memory initialization on large machines is deferred then this is the first PFN that needs to be initialized. Defined only when
CONFIG_DEFERRED_STRUCT_PAGE_INITis enableddeferred_split_queuePer-node queue of huge pages that their split was deferred. Defined only when
CONFIG_TRANSPARENT_HUGEPAGEis enabled.__lruvecPer-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. It should not be accessed directly, use
mem_cgroup_lruvec()to look up lruvecs instead.
Reclaim control¶
See also Page Reclaim.
kswapdPer-node instance of kswapd kernel thread.
kswapd_wait,pfmemalloc_wait,reclaim_waitWorkqueues used to synchronize memory reclaim tasks
nr_writeback_throttledNumber of tasks that are throttled waiting on dirty pages to clean.
nr_reclaim_startNumber of pages written while reclaim is throttled waiting for writeback.
kswapd_orderControls the order kswapd tries to reclaim
kswapd_highest_zoneidxThe highest zone index to be reclaimed by kswapd
kswapd_failuresNumber of runs kswapd was unable to reclaim any pages
min_unmapped_pagesMinimal number of unmapped file backed pages that cannot be reclaimed. Determined by
vm.min_unmapped_ratiosysctl. Only defined whenCONFIG_NUMAis enabled.min_slab_pagesMinimal number of SLAB pages that cannot be reclaimed. Determined by
vm.min_slab_ratio sysctl. Only defined whenCONFIG_NUMAis enabledflagsFlags controlling reclaim behavior.
Compaction control¶
kcompactd_max_orderPage order that kcompactd should try to achieve.
kcompactd_highest_zoneidxThe highest zone index to be compacted by kcompactd.
kcompactd_waitWorkqueue used to synchronize memory compaction tasks.
kcompactdPer-node instance of kcompactd kernel thread.
proactive_compact_triggerDetermines if proactive compaction is enabled. Controlled by
vm.compaction_proactivenesssysctl.
Statistics¶
per_cpu_nodestatsPer-CPU VM statistics for the node
vm_statVM statistics for the node.
Zones¶
As we have mentioned, each zone in memory is described by a struct zone
which is an element of the node_zones array of the node it belongs to.
struct zone is the core data structure of the page allocator. A zone
represents a range of physical memory and may have holes.
The page allocator uses the GFP flags, see Memory Allocation Controls, specified by
a memory allocation to determine the highest zone in a node from which the
memory allocation can allocate memory. The page allocator first allocates memory
from that zone, if the page allocator can’t allocate the requested amount of
memory from the zone, it will allocate memory from the next lower zone in the
node, the process continues up to and including the lowest zone. For example, if
a node contains ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE and the
highest zone of a memory allocation is ZONE_MOVABLE, the order of the zones
from which the page allocator allocates memory is ZONE_MOVABLE >
ZONE_NORMAL > ZONE_DMA32.
At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel’s memory management system. By handling most frequent allocations and frees locally on each CPU, the Per-CPU Pagesets improve performance and scalability, especially on systems with many cores. The page allocator in the kernel employs a two-step strategy for memory allocation, starting with the Per-CPU Pagesets before falling back to the buddy allocator. Pages are transferred between the Per-CPU Pagesets and the global free areas (managed by the buddy allocator) in batches. This minimizes the overhead of frequent interactions with the global buddy allocator.
Architecture specific code calls free_area_init() to initializes zones.
Zone structure¶
The zones structure struct zone is defined in include/linux/mmzone.h.
Here we briefly describe fields of this structure:
General¶
_watermarkThe watermarks for this zone. When the amount of free pages in a zone is below the min watermark, boosting is ignored, an allocation may trigger direct reclaim and direct compaction, it is also used to throttle direct reclaim. When the amount of free pages in a zone is below the low watermark, kswapd is woken up. When the amount of free pages in a zone is above the high watermark, kswapd stops reclaiming (a zone is balanced) when the
NUMA_BALANCING_MEMORY_TIERINGbit ofsysctl_numa_balancing_modeis not set. The promo watermark is used for memory tiering and NUMA balancing. When the amount of free pages in a zone is above the promo watermark, kswapd stops reclaiming when theNUMA_BALANCING_MEMORY_TIERINGbit ofsysctl_numa_balancing_modeis set. The watermarks are set by__setup_per_zone_wmarks(). The min watermark is calculated according tovm.min_free_kbytessysctl. The other three watermarks are set according to the distance between two watermarks. The distance itself is calculated takingvm.watermark_scale_factorsysctl into account.watermark_boostThe number of pages which are used to boost watermarks to increase reclaim pressure to reduce the likelihood of future fallbacks and wake kswapd now as the node may be balanced overall and kswapd will not wake naturally.
nr_reserved_highatomicThe number of pages which are reserved for high-order atomic allocations.
nr_free_highatomicThe number of free pages in reserved highatomic pageblocks
lowmem_reserveThe array of the amounts of the memory reserved in this zone for memory allocations. For example, if the highest zone a memory allocation can allocate memory from is
ZONE_MOVABLE, the amount of memory reserved in this zone for this allocation islowmem_reserve[ZONE_MOVABLE]when attempting to allocate memory from this zone. This is a mechanism the page allocator uses to prevent allocations which could usehighmemfrom using too muchlowmem. For some specialised workloads onhighmemmachines, it is dangerous for the kernel to allow process memory to be allocated from thelowmemzone. This is because that memory could then be pinned via themlock()system call, or by unavailability of swapspace.vm.lowmem_reserve_ratiosysctl determines how aggressive the kernel is in defending these lower zones. This array is recalculated bysetup_per_zone_lowmem_reserve()at runtime ifvm.lowmem_reserve_ratiosysctl changes.nodeThe index of the node this zone belongs to. Available only when
CONFIG_NUMAis enabled because there is only one zone in a UMA system.zone_pgdatPointer to the
struct pglist_dataof the node this zone belongs to.per_cpu_pagesetPointer to the Per-CPU Pagesets (PCP) allocated and initialized by
setup_zone_pageset(). By handling most frequent allocations and frees locally on each CPU, PCP improves performance and scalability on systems with many cores.pageset_high_minCopied to the
high_minof the Per-CPU Pagesets for faster access.pageset_high_maxCopied to the
high_maxof the Per-CPU Pagesets for faster access.pageset_batchCopied to the
batchof the Per-CPU Pagesets for faster access. Thebatch,high_minandhigh_maxof the Per-CPU Pagesets are used to calculate the number of elements the Per-CPU Pagesets obtain from the buddy allocator under a single hold of the lock for efficiency. They are also used to decide if the Per-CPU Pagesets return pages to the buddy allocator in page free process.pageblock_flagsThe pointer to the flags for the pageblocks in the zone (see
include/linux/pageblock-flags.hfor flags list). The memory is allocated insetup_usemap(). Each pageblock occupiesNR_PAGEBLOCK_BITSbits. Defined only whenCONFIG_FLATMEMis enabled. The flags is stored inmem_sectionwhenCONFIG_SPARSEMEMis enabled.zone_start_pfnThe start pfn of the zone. It is initialized by
calculate_node_totalpages().managed_pagesThe present pages managed by the buddy system, which is calculated as:
managed_pages=present_pages-reserved_pages,reserved_pagesincludes pages allocated by the memblock allocator. It should be used by page allocator and vm scanner to calculate all kinds of watermarks and thresholds. It is accessed usingatomic_long_xxx()functions. It is initialized infree_area_init_core()and then is reinitialized when memblock allocator frees pages into buddy system.spanned_pagesThe total pages spanned by the zone, including holes, which is calculated as:
spanned_pages=zone_end_pfn-zone_start_pfn. It is initialized bycalculate_node_totalpages().present_pagesThe physical pages existing within the zone, which is calculated as:
present_pages=spanned_pages-absent_pages(pages in holes). It may be used by memory hotplug or memory power management logic to figure out unmanaged pages by checking (present_pages-managed_pages). Write access topresent_pagesat runtime should be protected bymem_hotplug_begin/done(). Any reader who can’t tolerant drift ofpresent_pagesshould useget_online_mems()to get a stable value. It is initialized bycalculate_node_totalpages().present_early_pagesThe present pages existing within the zone located on memory available since early boot, excluding hotplugged memory. Defined only when
CONFIG_MEMORY_HOTPLUGis enabled and initialized bycalculate_node_totalpages().cma_pagesThe pages reserved for CMA use. These pages behave like
ZONE_MOVABLEwhen they are not used for CMA. Defined only whenCONFIG_CMAis enabled.nameThe name of the zone. It is a pointer to the corresponding element of the
zone_namesarray.nr_isolate_pageblockNumber of isolated pageblocks. It is used to solve incorrect freepage counting problem due to racy retrieving migratetype of pageblock. Protected by
zone->lock. Defined only whenCONFIG_MEMORY_ISOLATIONis enabled.span_seqlockThe seqlock to protect
zone_start_pfnandspanned_pages. It is a seqlock because it has to be read outside ofzone->lock, and it is done in the main allocator path. However, the seqlock is written quite infrequently. Defined only whenCONFIG_MEMORY_HOTPLUGis enabled.initializedThe flag indicating if the zone is initialized. Set by
init_currently_empty_zone()during boot.free_areaThe array of free areas, where each element corresponds to a specific order which is a power of two. The buddy allocator uses this structure to manage free memory efficiently. When allocating, it tries to find the smallest sufficient block, if the smallest sufficient block is larger than the requested size, it will be recursively split into the next smaller blocks until the required size is reached. When a page is freed, it may be merged with its buddy to form a larger block. It is initialized by
zone_init_free_lists().unaccepted_pagesThe list of pages to be accepted. All pages on the list are
MAX_PAGE_ORDER. Defined only whenCONFIG_UNACCEPTED_MEMORYis enabled.flagsThe zone flags. The least three bits are used and defined by
enum zone_flags.ZONE_BOOSTED_WATERMARK(bit 0): zone recently boosted watermarks. Cleared when kswapd is woken.ZONE_RECLAIM_ACTIVE(bit 1): kswapd may be scanning the zone.ZONE_BELOW_HIGH(bit 2): zone is below high watermark.lockThe main lock that protects the internal data structures of the page allocator specific to the zone, especially protects
free_area.percpu_drift_markWhen free pages are below this point, additional steps are taken when reading the number of free pages to avoid per-cpu counter drift allowing watermarks to be breached. It is updated in
refresh_zone_stat_thresholds().
Compaction control¶
compact_cached_free_pfnThe PFN where compaction free scanner should start in the next scan.
compact_cached_migrate_pfnThe PFNs where compaction migration scanner should start in the next scan. This array has two elements: the first one is used in
MIGRATE_ASYNCmode, and the other one is used inMIGRATE_SYNCmode.compact_init_migrate_pfnThe initial migration PFN which is initialized to 0 at boot time, and to the first pageblock with migratable pages in the zone after a full compaction finishes. It is used to check if a scan is a whole zone scan or not.
compact_init_free_pfnThe initial free PFN which is initialized to 0 at boot time and to the last pageblock with free
MIGRATE_MOVABLEpages in the zone. It is used to check if it is the start of a scan.compact_consideredThe number of compactions attempted since last failure. It is reset in
defer_compaction()when a compaction fails to result in a page allocation success. It is increased by 1 incompaction_deferred()when a compaction should be skipped.compaction_deferred()is called beforecompact_zone()is called,compaction_defer_reset()is called whencompact_zone()returnsCOMPACT_SUCCESS,defer_compaction()is called whencompact_zone()returnsCOMPACT_PARTIAL_SKIPPEDorCOMPACT_COMPLETE.compact_defer_shiftThe number of compactions skipped before trying again is
1<<compact_defer_shift. It is increased by 1 indefer_compaction(). It is reset incompaction_defer_reset()when a direct compaction results in a page allocation success. Its maximum value isCOMPACT_MAX_DEFER_SHIFT.compact_order_failedThe minimum compaction failed order. It is set in
compaction_defer_reset()when a compaction succeeds and indefer_compaction()when a compaction fails to result in a page allocation success.compact_blockskip_flushSet to true when compaction migration scanner and free scanner meet, which means the
PB_migrate_skipbits should be cleared.contiguousSet to true when the zone is contiguous (in other words, no hole).
Statistics¶
vm_statVM statistics for the zone. The items tracked are defined by
enum zone_stat_item.vm_numa_eventVM NUMA event statistics for the zone. The items tracked are defined by
enum numa_stat_item.per_cpu_zonestatsPer-CPU VM statistics for the zone. It records VM statistics and VM NUMA event statistics on a per-CPU basis. It reduces updates to the global
vm_statandvm_numa_eventfields of the zone to improve performance.
Pages¶
Stub
This section is incomplete. Please list and describe the appropriate fields.
Folios¶
Stub
This section is incomplete. Please list and describe the appropriate fields.
Initialization¶
Stub
This section is incomplete. Please list and describe the appropriate fields.