Scheduler code is very hot and every little optimization counts. Instead
of constantly checking sched_numa_balancing when NUMA is disabled,
compile it out.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I7334594fbe835f615a199cfe02ee526135abab06
This is tuned to match energy model characteristics and scheduler
efficiency enhancements.
Change-Id: Ia60e1ea888457fa1c0c0273cdd4b0180f0a87abf
Co-authored-by: Diep Quynh <remilia.1505@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
It's needed to speed up trace points and other dynamic debugging stuff.
Bug: 145162121
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I811b538bc5280a633c56e0544ba1f54cd6b234f2
Switch to 1 MiB static log buffer in __log_buf[]:
define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
instead of having the log buffer reallocated at boot by:
setup_log_buf()
log_buf_add_cpu()
log_buf_len_update()
new_log_buf = memblock_virt_alloc_nopanic()
There is no need to do this reallocation for the log buffer.
Change-Id: I8bf00b1fe45e9f6393e332e88642ee0c8a85ad7e
Signed-off-by: Petri Gynther <pgynther@google.com>
As previous projects, disable sched autocgroup helps
reduce jank in certain workloads
Bug: 144961955
Bug: 143857245
Test: build and boot to home
Change-Id: I5f7cf53fede9e70aa389eed741bc2f9a624ee39d
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
HW tracing features shouldn't be enabled in any final product. So
disable it.
Bug: 154966878
Signed-off-by: Saravana Kannan <saravanak@google.com>
Change-Id: I6603e71b0912dd89d653bb0bd36a0a4cb8b504e1
This reserved memory dump region is intended to be used with the memory
dump v2 driver, but we've disabled that and we don't need this memory
dumping functionality. Remove the unused region and assiciated driver
node to save 36 MiB of memory.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I5f784d6a88ff00b26a49fc507bd05881a9822965
Disable the MSM watchdog during suspend by removing "qcom,wakeup-enable".
Some watchdog timed out reset issues showed watchdog
expiration issues happened during suspend, especially when waiting for
s2idle_wait_head in s2idle_enter. But due to external interrupts are
disabled before this function, the watchdog petting timer might not
working as expectation. To avoid introducing watchdog reset when
execution of suspend/resume is not hang, remove "qcom,wakeup-enable" to
disable the feature.
Bug: 190429220
Change-Id: I7ce0ef57da15925cd024d602039d303c523bfd9b
Merged-In: I7ce0ef57da15925cd024d602039d303c523bfd9b
Signed-off-by: Woody Lin <woodylin@google.com>
(cherry picked from commit da702ade8884424ee578a3db2d4aaa217d8b85d6)
[dereference23: Apply for atoll]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This is primarily intended for generic x86 kernels which may run on all
sorts of broken systems. Our kernel runs on known hardware, so this is
unnecessary.
Disable it for a minor IRQ handler overhead reduction.
Suggested-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I1992d2c88d3f3b9a9d15748d6c3c6bf6709d8812
Coresight is used for debugging purposes. When the debugging configs are
disabled, having these included causes power regressions due to clks
being left on. So lets disable all the coresight DT entries by default.
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Bug: 156429236
Test: compile, verify list of probed devices
Change-Id: I84f9c874f2f5e8720ced23c7b4268d1b536b96a7
Userspace reads /proc/config.gz and spits out an error message after boot
finishes when it doesn't like the kernel's configuration. In order to
preserve our freedom to customize the kernel however we'd like, show
userspace the stock config so that it never complains about our
kernel configuration.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: BlackMesa123 <giangrecosalvo9@gmail.com>
When Simple LMK is enabled, the page allocator slowpath always thinks that
no OOM kill progress is made because out_of_memory() returns false. As a
result, spurious page allocation failures are observed when memory is low
and Simple LMK is killing tasks, simply because the page allocator slowpath
doesn't think that any OOM killing is taking place.
Fix this by simply making out_of_memory() always return true when Simple
LMK is enabled.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM reaper makes it possible to immediately release anonymous memory
from a dying process in order to free up memory faster. This provides
immediate relief under heavy memory pressure instead of waiting for victim
processes to naturally release their memory.
Utilize the OOM reaper by creating another kthread in Simple LMK to perform
victim reaping. Similar to the OOM reaper kthread (which is unused with
Simple LMK), this new kthread allows reaping to race with exit_mmap() in
order to preclude the need to take a reference to an mm's address space and
thus potentially mmput() an mm's last reference. Doing so would stall the
reaper kthread, preventing it from being able to quickly reap new victims.
Reaping is done on victims one at a time by descending order of anonymous
pages, so that the most promising victims with the most anonymous pages
are reaped first. Victims are also marked for reaping via MMF_OOM_VICTIM so
that they reap themselves first in exit_mmap(). Even if a victim isn't
reaped by the reaper thread, it'll free its anonymous memory first thing in
exit_mmap() as a small win towards making memory available sooner.
By relieving memory pressure faster via reaping, Simple LMK not only
doesn't need to kill as many processes, but also improves system
responsiveness when memory is low since memory pressure is relieved sooner.
Although not strictly required, Simple LMK should be the only one utilizing
the OOM reaper. Any other code that may utilize the OOM reaper, such as
patches that invoke the OOM reaper for all SIGKILLs, should be disabled.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We can check if the waitqueue is actually active before calling wake_up()
in order to avoid an unnecessary wake_up() if the reclaim thread is already
running. Furthermore, the release barrier when zeroing needs_reclaim is
unnecessary, so remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Under extreme simulated memory pressure, the 'no processes available to
kill' message can be spammed hundreds of thousands of times, which is not
productive. Ratelimit it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
As it turns out, victim scheduling priority elevation has always been
broken for two reasons:
1. The minimum valid RT priority is 1, not 0. As a result,
sched_setscheduler_nocheck() always fails with -EINVAL.
2. The thread within a victim thread group which happens to hold the mm is
not necessarily the only thread with references to the mm, and isn't
necessarily the thread which will release the final mm reference. As a
result, victim threads which hold mm references may take a while to
release them, and the unlucky thread which puts the final mm reference
may take a very long time to release all memory if it doesn't have RT
scheduling priority.
These issues cause victims to often take a very long time to release their
memory, possibly up to several seconds depending on system load. This, in
turn, causes Simple LMK to constantly hit the reclaim timeout and kill more
processes, with Simple LMK being rather ineffective since victims may not
release any memory for several seconds.
Fix the broken scheduling priority elevation by changing the RT priority to
the valid lowest priority of 1 and applying it to all threads in the thread
group, instead of just the thread which holds the mm.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
With freezable cgroups and their recent utilization in Android, it's
possible for some of Simple LMK's victims to be frozen at the time that
they're selected for killing. The forced SIGKILL used for killing
victims can only wake up processes containing TASK_WAKEKILL and/or
TASK_INTERRUPTIBLE, not TASK_UNINTERRUPTIBLE, which is the state used on
frozen tasks. In order to wake frozen tasks from their uninterruptible
slumber so that they can die, we must thaw them. Leaving victims frozen
can otherwise make them take an indefinite amount of time to process our
SIGKILL and thus free memory.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There are two problems with the current uninterruptible wait used in the
reclaim thread: the hung task detector is upset about an uninterruptible
thread being asleep for so long, and killing processes can generate I/O.
Since killing a process can generate I/O, the reclaim thread should
participate in system-wide suspend operations. This neatly solves the
hung task detector issue since wait_event_freezable() puts the current
process into an interruptible sleep.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
If it's possible for a task to have no pages, then there could be a case
where `pages_found` is zero while `nr_found` isn't, which would cause
the found tasks' locks to never be unlocked, and thus mayhem. We can
change the `pages_found` check to use `nr_found` instead in order to
naturally defend against this scenario, in case it is indeed possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Throttled direct reclaimers will wake up kswapd and wait for kswapd to
satisfy their page allocation request, even when the failed allocation
lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd
may think that there are no waiters and thus exit prematurely, causing
throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on
waiting for kswapd to wake them up. Incrementing the kswapd_waiters
counter when such direct reclaimers become throttled fixes the problem.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.
Since this is also the last user of free_more_memory(), kill
it off completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.
After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After a period of intense memory pressure is over, it's common for
vmpressure to still have old reclaim efficiency data accumulated from
this time. When memory pressure starts to rise again, this stale data
will factor into vmpressure's calculations, and can cause vmpressure to
report an erroneously high pressure. The reverse is possible, too:
vmpressure may report pressures that are erroneously low due to stale
data that's been stored.
Furthermore, since kswapd can still be performing reclaim when there are
no failed memory allocations stuck in the page allocator's slow path,
vmpressure may still report pressures when there aren't any memory
allocations to satisfy. This can cause last-resort memory reclaimers to
kill processes to free memory when it's not needed.
To fix the rampant stale data, keep track of when there are processes
utilizing reclaim in the page allocator's slow path, and reset the
accumulated data in vmpressure when a new period of elevated memory
pressure begins. Extra measures are taken for the kswapd issue mentioned
above by ignoring all reclaim efficiency data reported by kswapd when
there aren't any failed memory allocations in the page allocator which
utilize reclaim.
Note that since sr_lock can now be used from IRQ context, IRQs must be
disabled whenever sr_lock is used to prevent deadlocks.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Since the code that determines whether data should be cleared and the
code that actually clears the data are in separate spin-locked critical
sections, new data could be generated on another CPU after it is
determined that the existing data should be cleared, but before the
current CPU clears the existing data. This would cause the new data
reported by the other CPU to be lost.
Fix the race by clearing accumulated data within the same spin-locked
critical section that determines whether or not data should be cleared.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The direct reclaim vmpressure path was erroneously excluded from the
PAGE_ALLOC_COSTLY_ORDER check which was added in commit "mm: vmpressure:
Ignore allocation orders above PAGE_ALLOC_COSTLY_ORDER".
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Hard-coding adj ranges to search for victims results in a few problems.
Firstly, the hard-coded adjs must be vigilantly updated to match what
userspace uses, which makes long-term support a headache. Secondly, a
full traversal of every running process must be done for each adj range,
which can turn out to be quite expensive, especially if userspace
assigns many different adj values and we want to enumerate them all.
This leads us to the final problem, which is that processes with
different adjs within the same hard-coded adj range will be treated the
same, even though they're not: the process with a higher adj is less
important, and the process with a lower adj is more important. This
could be fixed by enumerating every possible adj, but again, that would
necessitate several scans through the active process list, which is bad
for performance, especially since latency is critical here.
Since adjs are only 16 bits, and we only care about positive adjs, that
leaves us with 15 bits of the adj that matter. This is a relatively
small number of potential adjs (32,768), which makes it possible to
allocate a static array that's indexed using the adj. Each entry in this
array is a pointer to the first task_struct in a singly-linked list of
task_structs sharing an adj. A `simple_lmk_next` member is added to
task_struct to accommodate this linked list. The victim finder now
iterates downward through the array searching for linked lists of tasks,
starting from the highest adj found, so that the lowest-priority
processes are always considered first for reclaim. This fixes all of the
problems mentioned above, and now there is only one traversal through
every running process. The array itself only takes up 256 KiB of memory
on 64-bit, which is a very small price to pay for the advantages gained.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The victims array and mm_free_lock data structures can be used very
heavily in parallel on SMP, in which case they would benefit from being
cacheline-aligned. Make it so for SMP.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When sort() isn't provided with a custom swap function, it falls back
onto its generic implementation of just swapping one byte at a time,
which is quite slow. Since we know the type of the objects being sorted,
we can provide our own swap function which simply uses the swap() macro.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When there aren't enough pages found, it means all of the victims that
were found need to be killed. The additional processing that attempts to
reduce the number of victims can be skipped in this case.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>