Since the code that determines whether data should be cleared and the
code that actually clears the data are in separate spin-locked critical
sections, new data could be generated on another CPU after it is
determined that the existing data should be cleared, but before the
current CPU clears the existing data. This would cause the new data
reported by the other CPU to be lost.
Fix the race by clearing accumulated data within the same spin-locked
critical section that determines whether or not data should be cleared.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The direct reclaim vmpressure path was erroneously excluded from the
PAGE_ALLOC_COSTLY_ORDER check which was added in commit "mm: vmpressure:
Ignore allocation orders above PAGE_ALLOC_COSTLY_ORDER".
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Hard-coding adj ranges to search for victims results in a few problems.
Firstly, the hard-coded adjs must be vigilantly updated to match what
userspace uses, which makes long-term support a headache. Secondly, a
full traversal of every running process must be done for each adj range,
which can turn out to be quite expensive, especially if userspace
assigns many different adj values and we want to enumerate them all.
This leads us to the final problem, which is that processes with
different adjs within the same hard-coded adj range will be treated the
same, even though they're not: the process with a higher adj is less
important, and the process with a lower adj is more important. This
could be fixed by enumerating every possible adj, but again, that would
necessitate several scans through the active process list, which is bad
for performance, especially since latency is critical here.
Since adjs are only 16 bits, and we only care about positive adjs, that
leaves us with 15 bits of the adj that matter. This is a relatively
small number of potential adjs (32,768), which makes it possible to
allocate a static array that's indexed using the adj. Each entry in this
array is a pointer to the first task_struct in a singly-linked list of
task_structs sharing an adj. A `simple_lmk_next` member is added to
task_struct to accommodate this linked list. The victim finder now
iterates downward through the array searching for linked lists of tasks,
starting from the highest adj found, so that the lowest-priority
processes are always considered first for reclaim. This fixes all of the
problems mentioned above, and now there is only one traversal through
every running process. The array itself only takes up 256 KiB of memory
on 64-bit, which is a very small price to pay for the advantages gained.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The victims array and mm_free_lock data structures can be used very
heavily in parallel on SMP, in which case they would benefit from being
cacheline-aligned. Make it so for SMP.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When sort() isn't provided with a custom swap function, it falls back
onto its generic implementation of just swapping one byte at a time,
which is quite slow. Since we know the type of the objects being sorted,
we can provide our own swap function which simply uses the swap() macro.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When there aren't enough pages found, it means all of the victims that
were found need to be killed. The additional processing that attempts to
reduce the number of victims can be skipped in this case.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the mm_free_lock write lock is held, it means that reclaim is
either starting or ending, in which case there's nothing that needs to
be done in simple_lmk_mm_freed(). We can use a trylock here instead to
avoid blocking.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Userspace could change these tunables and make Simple LMK function
poorly. Don't export them.
Reported-by: attack11 <fernandobouchet@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK uses VM pressure now, not a kswapd hook like before. Update
the Kconfig description to reflect such.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When PSI is enabled, lmkd in userspace will use PSI notifications to
perform low memory kills. Therefore, to ensure that Simple LMK is the
only active LMK implementation, add a !PSI dependency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This aids in selecting an adequate timeout. If the timeout is hit often
and Simple LMK is killing too much, then the timeout should be
lengthened. If the timeout is rarely hit and Simple LMK is not killing
fast enough under pressure, then the timeout should be shortened.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The synchronize_rcu() in namespace_unlock() is called every time
a filesystem is unmounted. If a great many filesystems are mounted,
this can cause a noticable slow-down in, for example, system shutdown.
The sequence:
mkdir -p /tmp/Mtest/{0..5000}
time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done
time umount /tmp/Mtest/*
on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and
100 seconds to unmount them.
Boot the same VM with 1 CPU and it takes 18 seconds to mount the
tmpfs filesystems, but only 36 to unmount.
If we change the synchronize_rcu() to synchronize_rcu_expedited()
the umount time on a 4-cpu VM drop to 0.6 seconds
I think this 200-fold speed up is worth the slightly high system
impact of using synchronize_rcu_expedited().
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Zeroing out the mm struct pointers when the timeout is hit isn't needed
because mm_free_lock prevents any readers from accessing the mm struct
pointers while clean-up occurs, and since the simple_lmk_mm_freed() loop
bound is set to zero during clean-up, there is no possibility of dying
processes ever reading stale mm struct pointers.
Therefore, it is unnecessary to clear out the mm struct pointers when
the timeout is reached. Now the only step to do when the timeout is
reached is to re-init the completion, but since reinit_completion() just
sets a struct member to zero, call reinit_completion() unconditionally
as it is faster than encapsulating it within a conditional statement.
Also take this opportunity to rename some variables and tidy up some
code indentation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We already check to see if each eligible process isn't already dying, so
an RCU read lock can be used to speed things up instead of holding the
tasklist read lock.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The page allocator wakes all kswapds in an allocation context's allowed
nodemask in the slow path, so it doesn't make sense to have the kswapd-
waiter count per each NUMA node. Instead, it should be a global counter
to stop all kswapds when there are no failed allocation requests.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We are allowed to kill any process with a positive adj, so we shouldn't
exclude any processes with adjs greater than 999. This would present a
problem with quirky applications that set their own adj score, such as
stress-ng. In the case of stress-ng, it would set its adj score to 1000
and thus exempt itself from being killed by Simple LMK. This shouldn't
be allowed; any process with a positive adj, up to the highest positive
adj possible (32767) should be killable.
Reported-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
PAGE_ALLOC_COSTLY_ORDER allocations can cause vmpressure to incorrectly
think that memory pressure is high, when it's really just that the
allocation's high order is difficult to satisfy. When this rare scenario
occurs, ignore the input to vmpressure to avoid sending out a spurious
high-pressure signal.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
It can be normal for a dying process to have its page allocation request
fail when it has an OOM or LMK kill pending. In this case, it's actually
detrimental to print out a massive allocation failure message because
this means the running process needs to die quickly and release its
memory, which is slowed down slightly by the massive kmsg splat. The
allocation failure message is also a false positive in this case, since
the failure is intentional rather than being the result of an inability
to allocate memory.
Suppress the allocation failure warning for processes that are killed to
release memory in order to expedite their death and remedy the kmsg
confusion from seeing spurious allocation failure messages.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The page allocator uses tsk_is_oom_victim() to determine when to
fast-path memory allocations in order to get an allocating process out
of the page allocator and into do_exit() quickly. Unfortunately,
tsk_is_oom_victim()'s check to see if a process is killed for OOM
purposes is to look for the presence of an OOM reaper artifact that only
the OOM killer sets. This means that for processes killed by Simple LMK,
there is no fast-pathing done in the page allocator to get them to die
faster.
Remedy this by changing tsk_is_oom_victim() to look for the existence of
the TIF_MEMDIE flag, which Simple LMK sets for its victims.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Caching the window size can result in delayed or inaccurate pressure
reports. Since calculating a fresh window size is cheap, do so all the
time instead of relying on a stale, cached value.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When no pages are scanned, it usually means no zones were reclaimable
and nothing could be done. In this case, the reported pressure should be
100 to elicit help from any listeners. This fixes the vmpressure
framework not working when memory pressure is very high.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Although userspace processes can't directly help with kernel memory
pressure, killing userspace processes can relieve kernel memory if they
are responsible for that pressure in the first place. It doesn't make
sense to exclude any allocation types knowing that userspace can indeed
affect all memory pressure, so don't exclude any allocation types from
the pressure calculations.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Android 10 changed its adj assignments. Update Simple LMK to use the
new adjs, which also requires looking at each pair of adjs as a range.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using kswapd's scan depth to trigger task kills is inconsistent and
unreliable. When memory pressure quickly spikes, the kswapd scan depth
trigger fails to kick off Simple LMK fast enough, causing severe lag.
Additionally, kswapd could stop scanning prematurely before reaching the
desired scan depth to trigger Simple LMK, which could also cause stalls.
To remedy this, use the vmpressure framework instead, since it provides
more consistent and accurate readings on memory pressure. This is not
very tunable though, so remove CONFIG_ANDROID_SIMPLE_LMK_AGGRESSION.
Triggering Simple LMK to kill when the reported memory pressure is 100
should yield good results on all setups.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Right now the vmpressure window is of constant size 2MB, which
works well with the following exceptions.
1) False vmpressure triggers are seen when the RAM size is greater
than 3GB. This results in lowmemorykiller, which uses vmpressure
events, killing tasks unnecessarily.
2) Vmpressure events are received late under memory pressure. This
behaviour is seen prominently in <=2GB RAM targets. This results in
lowmemorykiller kicking in late to kill tasks resulting in avoidable
page cache reclaim.
The problem analysis shows that the issue is with the constant size
of the vmpressure window which does not adapt to the varying memory
conditions. This patch recalculates the vmpressure window size at
the end of each window. The chosen window size is proportional to
the total of free and cached memory at that point.
Change-Id: I7e9ef4ddd82e2c2dd04ce09ec8d58a8829cfb64d
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
At present any vmpressure value is scaled up if the pages are
reclaimed through direct reclaim. This can result in false
vmpressure values. Consider a case where a device is booted up
and most of the memory is occuppied by file pages. kswapd will
make sure that high watermark is maintained. Now when a sudden
huge allocation request comes in, the system will definitely
have to get into direct reclaims. The vmpressures can be very low,
but because of allocstall accounting logic even these low values
will be scaled to values nearing 100. This can result in
unnecessary LMK kills for example. So define a tunable threshold
for vmpressure above which the allocstalls will be accounted.
Change-Id: Idd7c6724264ac89f1f68f2e9d70a32390ffca3e5
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The existing calculation of vmpressure takes into account only
the ratio of reclaimed to scanned pages, but not the time spent
or the difficulty in reclaiming those pages. For e.g. when there
are quite a number of file pages in the system, an allocation
request can be satisfied by reclaiming the file pages alone. If
such a reclaim is successful, the vmpressure value will remain low
irrespective of the time spent by the reclaim code to free up the
file pages. With a feature like lowmemorykiller, killing a task
can be faster than reclaiming the file pages alone. So if the
vmpressure values reflect the reclaim difficulty level, clients
can make a decision based on that, for e.g. to kill a task early.
This patch monitors the number of pages scanned in the direct
reclaim path and scales the vmpressure level according to that.
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Change-Id: I6e643d29a9a1aa0814309253a8b690ad86ec0b13
Currently, vmpressure is tied to memcg and its events are
available only to userspace clients. This patch removes
the dependency on CONFIG_MEMCG and adds a mechanism for
in-kernel clients to subscribe for vmpressure events (in
fact raw vmpressure values are delivered instead of vmpressure
levels, to provide clients more flexibility to take actions
on custom pressure levels which are not currently defined
by vmpressure module).
Change-Id: I38010f166546e8d7f12f5f355b5dbfd6ba04d587
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Resolve -Wenum-compare issue when comparing vmpressure level/model
against -1 (invalid state).
Change-Id: I1c76667ee8390e2d396c96e5ed73f30d0700ffa8
Signed-off-by: David Ng <dave@codeaurora.org>
Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Swap memory usage is important when determining what to kill, so include
it in the victim size calculation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
wake_up() executes a full memory barrier when waking a process up, so
there's no need for the acquire in the wait event. Additionally,
because of this, the atomic_cmpxchg() only needs a read barrier.
The cmpxchg() in simple_lmk_mm_freed() is atomic when it doesn't need to
be, so replace it with an extra line of code.
The atomic_inc_return() in simple_lmk_mm_freed() lies within a lock, so
it doesn't need explicit memory barriers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Just increasing the victim's priority to the maximum niceness isn't
enough to make it totally preempt everything in SCHED_FAIR, which is
important to make sure victims die quickly. Resource-wise, this isn't
very burdensome since the RT priority is just set to zero, and because
dying victims don't have much to do: they only need to finish whatever
they're doing quickly. SCHED_RR is used over SCHED_FIFO so that CPU time
between the victims is divided evenly to help them all finish at around
the same time, as fast as possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK tries to wait until all of the victims it kills have their
memory freed; however, sometimes victims can take a while to die, which
can block Simple LMK from killing more processes in time when needed.
After the specified timeout elapses, Simple LMK will stop waiting and
make itself available to kill more processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
set_user_nice() doesn't schedule, and although set_cpus_allowed_ptr()
can schedule, it will only do so when the specified task cannot run on
the new set of allowed CPUs. Since cpu_all_mask is used,
set_cpus_allowed_ptr() will never schedule. Therefore, both the priority
elevation and cpus_allowed change can be moved to inside the task lock
to simplify and speed things up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Queuing up reclaim requests while a reclaim is in progress doesn't make
sense, since the additional reclaims may not be needed after the
existing reclaim completes. This would cause Simple LMK to go berserk
during periods of high memory pressure where kswapd would fire off
reclaim requests nonstop.
Make Simple LMK ignore new reclaim requests until an existing reclaim is
finished to prevent a slaughter-fest.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After commit "simple_lmk: Make reclaim deterministic", Simple LMK's
behavior changed and thus requires some slight re-tuning to make it work
well again.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.
This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.
Fix this by using proper atomic helpers with memory barriers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
cmpxchg() is only atomic with respect to the local CPU, so it cannot be
relied on with how it's used in Simple LMK. Switch to fully atomic
operations instead for full atomic guarantees.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>