|
|
|
@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages |
|
|
|
|
that supports the automatic promotion and demotion of page sizes and |
|
|
|
|
without the shortcomings of hugetlbfs. |
|
|
|
|
|
|
|
|
|
Currently it only works for anonymous memory mappings but in the |
|
|
|
|
future it can expand over the pagecache layer starting with tmpfs. |
|
|
|
|
Currently it only works for anonymous memory mappings and tmpfs/shmem. |
|
|
|
|
But in the future it can expand to other filesystems. |
|
|
|
|
|
|
|
|
|
The reason applications are running faster is because of two |
|
|
|
|
factors. The first factor is almost completely irrelevant and it's not |
|
|
|
@ -57,10 +57,6 @@ miss is going to run faster. |
|
|
|
|
feature that applies to all dynamic high order allocations in the |
|
|
|
|
kernel) |
|
|
|
|
|
|
|
|
|
- this initial support only offers the feature in the anonymous memory |
|
|
|
|
regions but it'd be ideal to move it to tmpfs and the pagecache |
|
|
|
|
later |
|
|
|
|
|
|
|
|
|
Transparent Hugepage Support maximizes the usefulness of free memory |
|
|
|
|
if compared to the reservation approach of hugetlbfs by allowing all |
|
|
|
|
unused memory to be used as cache or other movable (or even unmovable |
|
|
|
@ -94,21 +90,21 @@ madvise(MADV_HUGEPAGE) on their critical mmapped regions. |
|
|
|
|
|
|
|
|
|
== sysfs == |
|
|
|
|
|
|
|
|
|
Transparent Hugepage Support can be entirely disabled (mostly for |
|
|
|
|
debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to |
|
|
|
|
avoid the risk of consuming more memory resources) or enabled system |
|
|
|
|
wide. This can be achieved with one of: |
|
|
|
|
Transparent Hugepage Support for anonymous memory can be entirely disabled |
|
|
|
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE |
|
|
|
|
regions (to avoid the risk of consuming more memory resources) or enabled |
|
|
|
|
system wide. This can be achieved with one of: |
|
|
|
|
|
|
|
|
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled |
|
|
|
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled |
|
|
|
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled |
|
|
|
|
|
|
|
|
|
It's also possible to limit defrag efforts in the VM to generate |
|
|
|
|
hugepages in case they're not immediately free to madvise regions or |
|
|
|
|
to never try to defrag memory and simply fallback to regular pages |
|
|
|
|
unless hugepages are immediately available. Clearly if we spend CPU |
|
|
|
|
time to defrag memory, we would expect to gain even more by the fact |
|
|
|
|
we use hugepages later instead of regular pages. This isn't always |
|
|
|
|
anonymous hugepages in case they're not immediately free to madvise |
|
|
|
|
regions or to never try to defrag memory and simply fallback to regular |
|
|
|
|
pages unless hugepages are immediately available. Clearly if we spend CPU |
|
|
|
|
time to defrag memory, we would expect to gain even more by the fact we |
|
|
|
|
use hugepages later instead of regular pages. This isn't always |
|
|
|
|
guaranteed, but it may be more likely in case the allocation is for a |
|
|
|
|
MADV_HUGEPAGE region. |
|
|
|
|
|
|
|
|
@ -133,9 +129,9 @@ that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. |
|
|
|
|
|
|
|
|
|
"never" should be self-explanatory. |
|
|
|
|
|
|
|
|
|
By default kernel tries to use huge zero page on read page fault. |
|
|
|
|
It's possible to disable huge zero page by writing 0 or enable it |
|
|
|
|
back by writing 1: |
|
|
|
|
By default kernel tries to use huge zero page on read page fault to |
|
|
|
|
anonymous mapping. It's possible to disable huge zero page by writing 0 |
|
|
|
|
or enable it back by writing 1: |
|
|
|
|
|
|
|
|
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page |
|
|
|
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page |
|
|
|
@ -204,21 +200,67 @@ Support by passing the parameter "transparent_hugepage=always" or |
|
|
|
|
"transparent_hugepage=madvise" or "transparent_hugepage=never" |
|
|
|
|
(without "") to the kernel command line. |
|
|
|
|
|
|
|
|
|
== Hugepages in tmpfs/shmem == |
|
|
|
|
|
|
|
|
|
You can control hugepage allocation policy in tmpfs with mount option |
|
|
|
|
"huge=". It can have following values: |
|
|
|
|
|
|
|
|
|
- "always": |
|
|
|
|
Attempt to allocate huge pages every time we need a new page; |
|
|
|
|
|
|
|
|
|
- "never": |
|
|
|
|
Do not allocate huge pages; |
|
|
|
|
|
|
|
|
|
- "within_size": |
|
|
|
|
Only allocate huge page if it will be fully within i_size. |
|
|
|
|
Also respect fadvise()/madvise() hints; |
|
|
|
|
|
|
|
|
|
- "advise: |
|
|
|
|
Only allocate huge pages if requested with fadvise()/madvise(); |
|
|
|
|
|
|
|
|
|
The default policy is "never". |
|
|
|
|
|
|
|
|
|
"mount -o remount,huge= /mountpoint" works fine after mount: remounting |
|
|
|
|
huge=never will not attempt to break up huge pages at all, just stop more |
|
|
|
|
from being allocated. |
|
|
|
|
|
|
|
|
|
There's also sysfs knob to control hugepage allocation policy for internal |
|
|
|
|
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount |
|
|
|
|
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or |
|
|
|
|
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. |
|
|
|
|
|
|
|
|
|
In addition to policies listed above, shmem_enabled allows two further |
|
|
|
|
values: |
|
|
|
|
|
|
|
|
|
- "deny": |
|
|
|
|
For use in emergencies, to force the huge option off from |
|
|
|
|
all mounts; |
|
|
|
|
- "force": |
|
|
|
|
Force the huge option on for all - very useful for testing; |
|
|
|
|
|
|
|
|
|
== Need of application restart == |
|
|
|
|
|
|
|
|
|
The transparent_hugepage/enabled values only affect future |
|
|
|
|
behavior. So to make them effective you need to restart any |
|
|
|
|
application that could have been using hugepages. This also applies to |
|
|
|
|
the regions registered in khugepaged. |
|
|
|
|
The transparent_hugepage/enabled values and tmpfs mount option only affect |
|
|
|
|
future behavior. So to make them effective you need to restart any |
|
|
|
|
application that could have been using hugepages. This also applies to the |
|
|
|
|
regions registered in khugepaged. |
|
|
|
|
|
|
|
|
|
== Monitoring usage == |
|
|
|
|
|
|
|
|
|
The number of transparent huge pages currently used by the system is |
|
|
|
|
available by reading the AnonHugePages field in /proc/meminfo. To |
|
|
|
|
identify what applications are using transparent huge pages, it is |
|
|
|
|
necessary to read /proc/PID/smaps and count the AnonHugePages fields |
|
|
|
|
for each mapping. Note that reading the smaps file is expensive and |
|
|
|
|
reading it frequently will incur overhead. |
|
|
|
|
The number of anonymous transparent huge pages currently used by the |
|
|
|
|
system is available by reading the AnonHugePages field in /proc/meminfo. |
|
|
|
|
To identify what applications are using anonymous transparent huge pages, |
|
|
|
|
it is necessary to read /proc/PID/smaps and count the AnonHugePages fields |
|
|
|
|
for each mapping. |
|
|
|
|
|
|
|
|
|
The number of file transparent huge pages mapped to userspace is available |
|
|
|
|
by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo. |
|
|
|
|
To identify what applications are mapping file transparent huge pages, it |
|
|
|
|
is necessary to read /proc/PID/smaps and count the FileHugeMapped fields |
|
|
|
|
for each mapping. |
|
|
|
|
|
|
|
|
|
Note that reading the smaps file is expensive and reading it |
|
|
|
|
frequently will incur overhead. |
|
|
|
|
|
|
|
|
|
There are a number of counters in /proc/vmstat that may be used to |
|
|
|
|
monitor how successfully the system is providing huge pages for use. |
|
|
|
@ -238,6 +280,12 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range |
|
|
|
|
of pages that should be collapsed into one huge page but failed |
|
|
|
|
the allocation. |
|
|
|
|
|
|
|
|
|
thp_file_alloc is incremented every time a file huge page is successfully |
|
|
|
|
i allocated. |
|
|
|
|
|
|
|
|
|
thp_file_mapped is incremented every time a file huge page is mapped into |
|
|
|
|
user address space. |
|
|
|
|
|
|
|
|
|
thp_split_page is incremented every time a huge page is split into base |
|
|
|
|
pages. This can happen for a variety of reasons but a common |
|
|
|
|
reason is that a huge page is old and is being reclaimed. |
|
|
|
@ -403,19 +451,27 @@ pages: |
|
|
|
|
on relevant sub-page of the compound page. |
|
|
|
|
|
|
|
|
|
- map/unmap of the whole compound page accounted in compound_mapcount |
|
|
|
|
(stored in first tail page). |
|
|
|
|
(stored in first tail page). For file huge pages, we also increment |
|
|
|
|
->_mapcount of all sub-pages in order to have race-free detection of |
|
|
|
|
last unmap of subpages. |
|
|
|
|
|
|
|
|
|
PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one. |
|
|
|
|
This additional reference is required to get race-free detection of unmap of |
|
|
|
|
subpages when we have them mapped with both PMDs and PTEs. |
|
|
|
|
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs. |
|
|
|
|
|
|
|
|
|
For anonymous pages PageDoubleMap() also indicates ->_mapcount in all |
|
|
|
|
subpages is offset up by one. This additional reference is required to |
|
|
|
|
get race-free detection of unmap of subpages when we have them mapped with |
|
|
|
|
both PMDs and PTEs. |
|
|
|
|
|
|
|
|
|
This is optimization required to lower overhead of per-subpage mapcount |
|
|
|
|
tracking. The alternative is alter ->_mapcount in all subpages on each |
|
|
|
|
map/unmap of the whole compound page. |
|
|
|
|
|
|
|
|
|
We set PG_double_map when a PMD of the page got split for the first time, |
|
|
|
|
but still have PMD mapping. The additional references go away with last |
|
|
|
|
compound_mapcount. |
|
|
|
|
For anonymous pages, we set PG_double_map when a PMD of the page got split |
|
|
|
|
for the first time, but still have PMD mapping. The additional references |
|
|
|
|
go away with last compound_mapcount. |
|
|
|
|
|
|
|
|
|
File pages get PG_double_map set on first map of the page with PTE and |
|
|
|
|
goes away when the page gets evicted from page cache. |
|
|
|
|
|
|
|
|
|
split_huge_page internally has to distribute the refcounts in the head |
|
|
|
|
page to the tail pages before clearing all PG_head/tail bits from the page |
|
|
|
@ -427,7 +483,7 @@ sum of mapcount of all sub-pages plus one (split_huge_page caller must |
|
|
|
|
have reference for head page). |
|
|
|
|
|
|
|
|
|
split_huge_page uses migration entries to stabilize page->_refcount and |
|
|
|
|
page->_mapcount. |
|
|
|
|
page->_mapcount of anonymous pages. File pages just got unmapped. |
|
|
|
|
|
|
|
|
|
We safe against physical memory scanners too: the only legitimate way |
|
|
|
|
scanner can get reference to a page is get_page_unless_zero(). |
|
|
|
|