After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.
next up previous contents index
Next: 3.2 Zones Up: 3. Describing Physical Memory Previous: 3. Describing Physical Memory   Contents   Index


3.1 Nodes

As we have mentioned, each node in memory is described by a pg_data_t struct. When allocating a page, Linux uses a node-local allocation policy to allocate memory from the node closest to the running CPU. As processes tend to run on the same CPU, it is likely the memory from the current node will be used. The struct is declared as follows in $<$linux/mmzone.h$>$:

129 typedef struct pglist_data {
130         zone_t node_zones[MAX_NR_ZONES];
131         zonelist_t node_zonelists[GFP_ZONEMASK+1];
132         int nr_zones;
133         struct page *node_mem_map;
134         unsigned long *valid_addr_bitmap;
135         struct bootmem_data *bdata;
136         unsigned long node_start_paddr;
137         unsigned long node_start_mapnr;
138         unsigned long node_size;
139         int node_id;
140         struct pglist_data *node_next;
141 } pg_data_t;

We now briefly describe each of these fields:

node_zones The zones for this node, ZONE_ HIGHMEM, ZONE_ NORMAL, ZONE_ DMA;

node_zonelists This is the order of zones that allocations are preferred from. build_zonelists() in page_alloc.c sets up the order when called by free_area_init_core(). A failed allocation in ZONE_ HIGHMEM may fall back to ZONE_ NORMAL or back to ZONE_ DMA;

nr_zones Number of zones in this node, between 1 and 3. Not all nodes will have three. A CPU bank may not have ZONE_ DMA for example;

node_mem_map This is the first page of the struct page array representing each physical frame in the node. It will be placed somewhere within the global mem_map array;

valid_addr_bitmap A bitmap which describes ``holes'' in the memory node that no memory exists for;

bdata This is only of interest to the boot memory allocator discussed in Chapter 6;

node_start_paddr The starting physical address of the node. An unsigned long does not work optimally as it breaks for ia323.1 with Physical Address Extension (PAE)3.2 for example. A more suitable solution would be to record this as a Page Frame Number (PFN) which could be trivially defined as (page_phys_addr >> PAGE_SHIFT);

node_start_mapnr This gives the page offset within the global mem_map. It is calculated in free_area_init_core() by calculating the number of pages between mem_map and the local mem_map for this node called lmem_map;

node_size The total number of pages in this zone;

node_id The ID of the node, starts at 0;

node_next Pointer to next node in a NULL terminated list.

All nodes in the system are maintained on a list called pgdat_list. The nodes are placed on this list as they are initialised by the init_bootmem_core() function, described later in Section 6.2.2. Up until late 2.4 kernels (> 2.4.18), blocks of code that traversed the list looked something like:

        pg_data_t * pgdat;
        pgdat = pgdat_list;
        do {
              /* do something with pgdata_t */
              ...
        } while ((pgdat = pgdat->node_next));

In more recent kernels, a macro for_each_pgdat(), which is trivially defined as a for loop, is provided to improve code readability.



Footnotes

... ia323.1
FYI from Jeff Haran: Some PowerPC variants appear to have this same problem (e.g. PPC440GP).
... (PAE)3.2
PAE is discussed further in Section 3.4.

next up previous contents index
Next: 3.2 Zones Up: 3. Describing Physical Memory Previous: 3. Describing Physical Memory   Contents   Index
Mel 2004-02-15