Just increasing the victim's priority to the maximum niceness isn't
enough to make it totally preempt everything in SCHED_FAIR, which is
important to make sure victims die quickly. Resource-wise, this isn't
very burdensome since the RT priority is just set to zero, and because
dying victims don't have much to do: they only need to finish whatever
they're doing quickly. SCHED_RR is used over SCHED_FIFO so that CPU time
between the victims is divided evenly to help them all finish at around
the same time, as fast as possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK tries to wait until all of the victims it kills have their
memory freed; however, sometimes victims can take a while to die, which
can block Simple LMK from killing more processes in time when needed.
After the specified timeout elapses, Simple LMK will stop waiting and
make itself available to kill more processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
set_user_nice() doesn't schedule, and although set_cpus_allowed_ptr()
can schedule, it will only do so when the specified task cannot run on
the new set of allowed CPUs. Since cpu_all_mask is used,
set_cpus_allowed_ptr() will never schedule. Therefore, both the priority
elevation and cpus_allowed change can be moved to inside the task lock
to simplify and speed things up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Queuing up reclaim requests while a reclaim is in progress doesn't make
sense, since the additional reclaims may not be needed after the
existing reclaim completes. This would cause Simple LMK to go berserk
during periods of high memory pressure where kswapd would fire off
reclaim requests nonstop.
Make Simple LMK ignore new reclaim requests until an existing reclaim is
finished to prevent a slaughter-fest.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After commit "simple_lmk: Make reclaim deterministic", Simple LMK's
behavior changed and thus requires some slight re-tuning to make it work
well again.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.
This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.
Fix this by using proper atomic helpers with memory barriers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
cmpxchg() is only atomic with respect to the local CPU, so it cannot be
relied on with how it's used in Simple LMK. Switch to fully atomic
operations instead for full atomic guarantees.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Previously, pages_found would be calculated using an uninitialized
variable. Fix it.
Reported-by: Julian Liu <wlootlxt123@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
WALT has check_for_migration() that calls find_energy_efficient_cpu().
But with CASS, using find_energy_efficient_cpu() is irrelevant. Since
check_for_migration() also doesn't prove to be much useful even without
the reference to find_energy_efficient_cpu(), do not use it while using
CASS with WALT. There's no need to use IS_ENABLED() for CONFIG_SCHED_WALT
here since the function is already guarded with CONFIG_SCHED_WALT.
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
The Capacity Aware Superset Scheduler (CASS) optimizes runqueue selection
of CFS tasks. By using CPU capacity as a basis for comparing the relative
utilization between different CPUs, CASS fairly balances load across CPUs
of varying capacities. This results in improved multi-core performance,
especially when CPUs are overutilized because CASS doesn't clip a CPU's
utilization when it eclipses the CPU's capacity.
As a superset of capacity aware scheduling, CASS implements a hierarchy of
criteria to determine the better CPU to wake a task upon between CPUs that
have the same relative utilization. This way, single-core performance,
latency, and cache affinity are all optimized where possible.
CASS doesn't feature explicit energy awareness but its basic load balancing
principle results in decreased overall energy, often better than what is
possible with explicit energy awareness. By fairly balancing load based on
relative utilization, all CPUs are kept at their lowest P-state necessary
to satisfy the overall load at any given moment.
This version of CASS is adjusted to work on older kernels.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: clarencelol <clarencekuiek@icloud.com>
When the rotator is actually used (still an unsolved question in
computer science), these PM QoS requests block some CPUs in the LITTLE
cluster from entering deep idle because the driver assumes that display
rotating work occurs on a hardcoded set of CPUs, which is false. We
already have the IRQ PM QoS machinery for display rendering operations
that actually matter, so this cruft is unneeded.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
These are blocking some CPUs in the LITTLE cluster from entering deep
idle because the driver assumes that display rendering work occurs on a
hardcoded set of CPUs, which is false. The scope of this is also quite
large, which increases power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
Combined with LTO, this yields a consistent 5% boost to procfs I/O
performance right off the bat (as measured with callbench). The spin
lock functions constitute some of the hottest code paths in the kernel;
inlining them to improve performance makes sense.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
A measurably significant amount of CPU time is spent in these routines
while the camera is open. These are also responsible for a grotesque
amount of dmesg spam, so let's nuke them.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This call to smp_processor_id() forces gic_raise_softirq() to require
being called while preemption is disabled, which isn't an actual
requirement. When called without preemption disabled, smp_processor_id()
is thus used incorrectly and generates a warning splat with the relevant
kernel debug options enabled.
Get rid of the useless pr_devel message outright to fix the incorrect
smp_processor_id() usage.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
In order to prevent redundant entry creation by racing against itself,
mb_cache_entry_create scans through a large hash-list of all current
entries in order to see if another allocation for the requested new
entry has been made. Furthermore, it allocates memory for a new entry
before scanning through this hash-list, which results in that allocated
memory being discarded when the requested new entry is already present.
This happens more than half the time.
Speed up cache entry creation by keeping a small linked list of
requested new entries in progress, and scanning through that first
instead of the large hash-list. Additionally, don't bother allocating
memory for a new entry until it's known that the allocated memory will
be used.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
For the vast majority of mmio operations in this driver, explicit memory
barriers aren't needed either because a data dependency between a read
and write already exists, or because of the presence of the spin locks
which execute a full memory barrier.
Removing all the unneeded explicit barriers considerably reduces
overhead for pinctrl operations, which in turn benefits things like i2c.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a rwsem. This can needlessly lengthen RCU
grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a mutex lock. This can needlessly lengthen
RCU grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(),
since it could abort early at the need_resched() check. In this case,
it's possible for an IPI to be sent to this "idle" CPU needlessly, thus
wasting power. For the same reason, it's also wasteful to keep a CPU
marked idle even after it's woken up.
Reduce the window that CPUs are marked idle to as small as it can be in
order to improve power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
The pm_qos callback currently suffers from a number of pitfalls: it
sends IPIs to CPUs that may not be idle, waits for those IPIs to finish
propagating while preemption is disabled (resulting in a long busy wait
for the pm_qos_update_target() caller), and needlessly calls a no-op
function when the IPIs are processed.
Optimize the pm_qos notifier by only sending IPIs to CPUs that are
idle, and by using arch_send_wakeup_ipi_mask() instead of
smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC,
which is what smp_call_function_many() uses behind the scenes, has the
benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is
designed purely for sending an IPI without a payload, whereas
IPI_CALL_FUNC does unwanted extra work just to run the empty
smp_callback() function.
Determining which CPUs are idle is done efficiently with an atomic
bitmask instead of using the wake_up_if_idle() API, which checks the
CPU's runqueue in an RCU read-side critical section and under a spin
lock. Not very efficient in comparison to a simple, atomic bitwise
operation. A cpumask isn't needed for this because NR_CPUS is
guaranteed to fit within a word.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
An empty IPI is useful for cpuidle to wake sleeping CPUs without causing
them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills
this use-case nicely, so let it be used outside of the ACPI parking
protocol.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
None of the pm_qos functions actually run in interrupt context; if some
driver calls pm_qos_update_target in interrupt context then it's already
broken. There's no need to disable interrupts while holding pm_qos_lock,
so don't do it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This reverts commit 1e5a5b5e00e9706cd48e3c87de1607fcaa5214d2.
This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?
Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.
Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This reverts commit 0db49c2550a09458db188fb7312c66783c5af104.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This scope of this driver's lock usage is extremely wide, leading to
excessively long lock hold times. Additionally, there is lots of
excessive linked-list traversal and unnecessary dynamic memory
allocation in a critical path, causing poor performance across the
board.
Fix all of this by greatly reducing the scope of the locks used and by
significantly reducing the amount of operations performed when
msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled
for better cleanliness and performance.
Note that ION must be modified to pass a known structure via the private
dma_buf pointer, so that the IOMMU driver can prevent races when
operating on the same buffer concurrently. This is the only way to
eliminate said buffer races without hurting the IOMMU driver's
performance.
Some additional members are added to the device struct as well to make
these various performance improvements possible.
This also removes the manual cache maintenance since ION already handles
it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
commit b312b4f0e2f9 ("iommu: arm-smmu: Preallocate memory for map
only on failure") had the following two errors:
1. The return code we checking when map_sg fails and we preallocte
is wrong. The check should be for 0 and not -ENOMEM.
So the preallocate is never happening when map_sg fails.
2. map_sg could've have mapped certain elements in sglist and later
had got failed. With proper check, we are trying to call map_sg
on the same size again, which would leave to double map of
previously mapped elements in sglist.
Fix this by returning the actual ret code from arm_lpae_map_sg()
and check it against -ENOMEM if we need to preallocate or not.
Also, unmap any partial iovas that was mapped previously.
Change-Id: Ifee7c0bed6b9cf1c35ebb4a03d51a1a80ab0ed58
Signed-off-by: Sudarshan Rajagopalan <sudaraja@codeaurora.org>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
page allocation failure: order:0, mode:0x2088020(GFP_ATOMIC|__GFP_ZERO)
Call trace:
[<ffffff80080f15c8>] dump_backtrace+0x0/0x248
[<ffffff80080f1894>] show_stack+0x18/0x28
[<ffffff8008484984>] dump_stack+0x98/0xc0
[<ffffff8008231b0c>] warn_alloc+0x114/0x134
[<ffffff8008231f7c>] __alloc_pages_nodemask+0x3e8/0xd30
[<ffffff8008232b2c>] alloc_pages_exact+0x4c/0xa4
[<ffffff800866bec4>] arm_smmu_alloc_pages_exact+0x188/0x1bc
[<ffffff8008664b28>] io_pgtable_alloc_pages_exact+0x30/0xa0
[<ffffff8008664ff8>] __arm_lpae_alloc_pages+0x40/0x1c8
[<ffffff8008665cb4>] __arm_lpae_map+0x224/0x3b4
[<ffffff8008665b98>] __arm_lpae_map+0x108/0x3b4
[<ffffff8008666474>] arm_lpae_map+0x78/0x9c
[<ffffff800866aed4>] arm_smmu_map+0x80/0xdc
[<ffffff800866015c>] iommu_map+0x118/0x284
[<ffffff8008c66294>] cam_smmu_alloc_firmware+0x188/0x3c0
[<ffffff8008cc8afc>] cam_icp_mgr_hw_open+0x88/0x874
[<ffffff8008cca030>] cam_icp_mgr_acquire_hw+0x2d4/0xc9c
[<ffffff8008c5fe84>] cam_context_acquire_dev_to_hw+0xb0/0x26c
[<ffffff8008cd0ce0>] __cam_icp_acquire_dev_in_available+0x1c/0xf0
[<ffffff8008c5ea98>] cam_context_handle_acquire_dev+0x5c/0x1a8
[<ffffff8008c619b4>] cam_node_handle_ioctl+0x30c/0xdc8
[<ffffff8008c62640>] cam_subdev_compat_ioctl+0xe4/0x1dc
[<ffffff8008bcf8bc>] subdev_compat_ioctl32+0x40/0x68
[<ffffff8008bd3858>] v4l2_compat_ioctl32+0x64/0x1780
In order to avoid page allocation failure of order 0 during the
smmu map operation, the existing implementation preallocates
the required memory using GFP_KERNEL so as to make sure that
there is sufficient page table memory available and the atomic
allocation succeeds during the map operation.This might not be
necessary for every single map call as the atomic allocation
might succeed most of the time.Hence preallocate the necessary
memory only when the map operation fails due to insufficient
memory and again retry the map operation with the preallocated
memory.This solution applies only to map calls made from a
non-atomic context.
Change-Id: I417f311c2224eb863d6c99612b678bbb2dd3db58
Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
When memory is leaking, it's going to be harder to allocate more memory,
making it more likely for this failure condition inside of kmemleak to
manifest itself. This is extremely frustrating since kmemleak kills
itself upon the first instance of memory allocation failure.
Bypass that and make kmemleak more resilient when memory is running low.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
The memory allocated dynamically here is just used to store a single
instance of a struct. Allocate both possible structs on the stack
instead of allocating them dynamically to improve performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
Trying to wait for fences that have already been signaled incurs a high
setup cost, since dynamic memory allocation must be used. Avoiding this
overhead when it isn't needed improves performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
A measurably significant amount of CPU time is spent on logging events
for debugging purposes in lpm_cpuidle_enter. Kill the useless logging to
reduce overhead.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
A lot of CPU time is wasted on allocating, populating, and copying
debug names back and forth with userspace when they're not actually
needed. We can't just remove the name buffers from the various sync data
structures though because we must preserve ABI compatibility with
userspace, but instead we can just pretend the name fields of the
user-shared structs aren't there. This massively reduces the sizes of
memory allocated for these data structures and the amount of data passed
between userspace, as well as eliminates a kzalloc() entirely from
sync_file_ioctl_fence_info(), thus improving graphics performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
Giving userspace intimate control over CPU latency requirements is
nonsense. Userspace can't even stop itself from being preempted, so
there's no reason for it to have access to a mechanism primarily used to
eliminate CPU delays on the order of microseconds.
Remove userspace's ability to send pm_qos requests so that it can't hurt
power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>
This allows pm_qos votes with, say, 100 us for example to select power
levels with exit latencies equal to 100 us. The extra microsecond of
exit latency doesn't hurt.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Ruchit <ruchitmarathe@gmail.com>