You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
388 lines
16 KiB
388 lines
16 KiB
Central, scheduler-driven, power-performance control
|
|
(EXPERIMENTAL)
|
|
|
|
Abstract
|
|
========
|
|
|
|
The topic of a single simple power-performance tunable, that is wholly
|
|
scheduler centric, and has well defined and predictable properties has come up
|
|
on several occasions in the past [1,2]. With techniques such as a scheduler
|
|
driven DVFS [3], we now have a good framework for implementing such a tunable.
|
|
This document describes the overall ideas behind its design and implementation.
|
|
|
|
|
|
Table of Contents
|
|
=================
|
|
|
|
1. Motivation
|
|
2. Introduction
|
|
3. Signal Boosting Strategy
|
|
4. OPP selection using boosted CPU utilization
|
|
5. Per task group boosting
|
|
6. Per-task wakeup-placement-strategy Selection
|
|
7. Question and Answers
|
|
- What about "auto" mode?
|
|
- What about boosting on a congested system?
|
|
- How CPUs are boosted when we have tasks with multiple boost values?
|
|
8. References
|
|
|
|
|
|
1. Motivation
|
|
=============
|
|
|
|
Schedutil [3] is a utilization-driven cpufreq governor which allows the
|
|
scheduler to select the optimal DVFS operating point (OPP) for running a task
|
|
allocated to a CPU.
|
|
|
|
However, sometimes it may be desired to intentionally boost the performance of
|
|
a workload even if that could imply a reasonable increase in energy
|
|
consumption. For example, in order to reduce the response time of a task, we
|
|
may want to run the task at a higher OPP than the one that is actually required
|
|
by it's CPU bandwidth demand.
|
|
|
|
This last requirement is especially important if we consider that one of the
|
|
main goals of the utilization-driven governor component is to replace all
|
|
currently available CPUFreq policies. Since schedutil is event-based, as
|
|
opposed to the sampling driven governors we currently have, they are already
|
|
more responsive at selecting the optimal OPP to run tasks allocated to a CPU.
|
|
However, just tracking the actual task utilization may not be enough from a
|
|
performance standpoint. For example, it is not possible to get behaviors
|
|
similar to those provided by the "performance" and "interactive" CPUFreq
|
|
governors.
|
|
|
|
This document describes an implementation of a tunable, stacked on top of the
|
|
utilization-driven governor which extends its functionality to support task
|
|
performance boosting.
|
|
|
|
By "performance boosting" we mean the reduction of the time required to
|
|
complete a task activation, i.e. the time elapsed from a task wakeup to its
|
|
next deactivation (e.g. because it goes back to sleep or it terminates). For
|
|
example, if we consider a simple periodic task which executes the same workload
|
|
for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
|
|
that task must complete each of its activations in less than 5[s].
|
|
|
|
The rest of this document introduces in more details the proposed solution
|
|
which has been named SchedTune.
|
|
|
|
|
|
2. Introduction
|
|
===============
|
|
|
|
SchedTune exposes a simple user-space interface provided through a new
|
|
CGroup controller 'stune' which provides two power-performance tunables
|
|
per group:
|
|
|
|
/<stune cgroup mount point>/schedtune.prefer_idle
|
|
/<stune cgroup mount point>/schedtune.boost
|
|
|
|
The CGroup implementation permits arbitrary user-space defined task
|
|
classification to tune the scheduler for different goals depending on the
|
|
specific nature of the task, e.g. background vs interactive vs low-priority.
|
|
|
|
More details are given in section 5.
|
|
|
|
2.1 Boosting
|
|
============
|
|
|
|
The boost value is expressed as an integer in the range [0..100].
|
|
|
|
A value of 0 (default) configures the CFS scheduler for maximum energy
|
|
efficiency. This means that schedutil runs the tasks at the minimum OPP
|
|
required to satisfy their workload demand.
|
|
|
|
A value of 100 configures scheduler for maximum performance, which translates
|
|
to the selection of the maximum OPP on that CPU.
|
|
|
|
The range between 0 and 100 can be set to satisfy other scenarios suitably. For
|
|
example to satisfy interactive response or depending on other system events
|
|
(battery level etc).
|
|
|
|
The overall design of the SchedTune module is built on top of "Per-Entity Load
|
|
Tracking" (PELT) signals and schedutil by introducing a bias on the OPP
|
|
selection.
|
|
|
|
Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
|
|
the operating frequency of that CPU to better match the workload demand. The
|
|
selection of the actual OPP being activated is influenced by the boost value
|
|
for the task CGroup.
|
|
|
|
This simple biasing approach leverages existing frameworks, which means minimal
|
|
modifications to the scheduler, and yet it allows to achieve a range of
|
|
different behaviours all from a single simple tunable knob.
|
|
|
|
In EAS schedulers, we use boosted task and CPU utilization for energy
|
|
calculation and energy-aware task placement.
|
|
|
|
2.2 prefer_idle
|
|
===============
|
|
|
|
This is a flag which indicates to the scheduler that userspace would like
|
|
the scheduler to focus on energy or to focus on performance.
|
|
|
|
A value of 0 (default) signals to the CFS scheduler that tasks in this group
|
|
can be placed according to the energy-aware wakeup strategy.
|
|
|
|
A value of 1 signals to the CFS scheduler that tasks in this group should be
|
|
placed to minimise wakeup latency.
|
|
|
|
Android platforms typically use this flag for application tasks which the
|
|
user is currently interacting with.
|
|
|
|
|
|
3. Signal Boosting Strategy
|
|
===========================
|
|
|
|
The whole PELT machinery works based on the value of a few load tracking signals
|
|
which basically track the CPU bandwidth requirements for tasks and the capacity
|
|
of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
|
|
some of these load tracking signals to make a task or RQ appears more demanding
|
|
that it actually is.
|
|
|
|
Which signals have to be inflated depends on the specific "consumer". However,
|
|
independently from the specific (signal, consumer) pair, it is important to
|
|
define a simple and possibly consistent strategy for the concept of boosting a
|
|
signal.
|
|
|
|
A boosting strategy defines how the "abstract" user-space defined
|
|
sched_cfs_boost value is translated into an internal "margin" value to be added
|
|
to a signal to get its inflated value:
|
|
|
|
margin := boosting_strategy(sched_cfs_boost, signal)
|
|
boosted_signal := signal + margin
|
|
|
|
The boosting strategy currently implemented in SchedTune is called 'Signal
|
|
Proportional Compensation' (SPC). With SPC, the sched_cfs_boost value is used to
|
|
compute a margin which is proportional to the complement of the original signal.
|
|
When a signal has a maximum possible value, its complement is defined as
|
|
the delta from the actual value and its possible maximum.
|
|
|
|
Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE as
|
|
the maximum possible value, the margin becomes:
|
|
|
|
margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
|
|
|
|
Using this boosting strategy:
|
|
- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
|
|
- each value in the range of sched_cfs_boost effectively inflates the signal in
|
|
question by a quantity which is proportional to the maximum value.
|
|
|
|
For example, by applying the SPC boosting strategy to the selection of the OPP
|
|
to run a task it is possible to achieve these behaviors:
|
|
|
|
- 0% boosting: run the task at the minimum OPP required by its workload
|
|
- 100% boosting: run the task at the maximum OPP available for the CPU
|
|
- 50% boosting: run at the half-way OPP between minimum and maximum
|
|
|
|
Which means that, at 50% boosting, a task will be scheduled to run at half of
|
|
the maximum theoretically achievable performance on the specific target
|
|
platform.
|
|
|
|
A graphical representation of an SPC boosted signal is represented in the
|
|
following figure where:
|
|
a) "-" represents the original signal
|
|
b) "b" represents a 50% boosted signal
|
|
c) "p" represents a 100% boosted signal
|
|
|
|
|
|
^
|
|
| SCHED_CAPACITY_SCALE
|
|
+-----------------------------------------------------------------+
|
|
|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
|
|
|
|
|
| boosted_signal
|
|
| bbbbbbbbbbbbbbbbbbbbbbbb
|
|
|
|
|
| original signal
|
|
| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
|
|
| |
|
|
|bbbbbbbbbbbbbbbbbb |
|
|
| |
|
|
| |
|
|
| |
|
|
| +-----------------------+
|
|
| |
|
|
| |
|
|
| |
|
|
|------------------+
|
|
|
|
|
|
|
|
+----------------------------------------------------------------------->
|
|
|
|
The plot above shows a ramped load signal (titled 'original_signal') and it's
|
|
boosted equivalent. For each step of the original signal the boosted signal
|
|
corresponding to a 50% boost is midway from the original signal and the upper
|
|
bound. Boosting by 100% generates a boosted signal which is always saturated to
|
|
the upper bound.
|
|
|
|
|
|
4. OPP selection using boosted CPU utilization
|
|
==============================================
|
|
|
|
It is worth calling out that the implementation does not introduce any new load
|
|
signals. Instead, it provides an API to tune existing signals. This tuning is
|
|
done on demand and only in scheduler code paths where it is sensible to do so.
|
|
The new API calls are defined to return either the default signal or a boosted
|
|
one, depending on the value of sched_cfs_boost. This is a clean an non invasive
|
|
modification of the existing existing code paths.
|
|
|
|
The signal representing a CPU's utilization is boosted according to the
|
|
previously described SPC boosting strategy. To schedutil, this allows a CPU
|
|
(ie CFS run-queue) to appear more used then it actually is.
|
|
|
|
Thus, with the sched_cfs_boost enabled we have the following main functions to
|
|
get the current utilization of a CPU:
|
|
|
|
cpu_util()
|
|
boosted_cpu_util()
|
|
|
|
The new boosted_cpu_util() is similar to the first but returns a boosted
|
|
utilization signal which is a function of the sched_cfs_boost value.
|
|
|
|
This function is used in the CFS scheduler code paths where schedutil needs to
|
|
decide the OPP to run a CPU at. For example, this allows selecting the highest
|
|
OPP for a CPU which has the boost value set to 100%.
|
|
|
|
|
|
5. Per task group boosting
|
|
==========================
|
|
|
|
On battery powered devices there usually are many background services which are
|
|
long running and need energy efficient scheduling. On the other hand, some
|
|
applications are more performance sensitive and require an interactive
|
|
response and/or maximum performance, regardless of the energy cost.
|
|
|
|
To better service such scenarios, the SchedTune implementation has an extension
|
|
that provides a more fine grained boosting interface.
|
|
|
|
A new CGroup controller, namely "schedtune", can be enabled which allows to
|
|
defined and configure task groups with different boosting values.
|
|
Tasks that require special performance can be put into separate CGroups.
|
|
The value of the boost associated with the tasks in this group can be specified
|
|
using a single knob exposed by the CGroup controller:
|
|
|
|
schedtune.boost
|
|
|
|
This knob allows the definition of a boost value that is to be used for
|
|
SPC boosting of all tasks attached to this group.
|
|
|
|
The current schedtune controller implementation is really simple and has these
|
|
main characteristics:
|
|
|
|
1) It is only possible to create 1 level depth hierarchies
|
|
|
|
The root control groups define the system-wide boost value to be applied
|
|
by default to all tasks. Its direct subgroups are named "boost groups" and
|
|
they define the boost value for specific set of tasks.
|
|
Further nested subgroups are not allowed since they do not have a sensible
|
|
meaning from a user-space standpoint.
|
|
|
|
2) It is possible to define only a limited number of "boost groups"
|
|
|
|
This number is defined at compile time and by default configured to 16.
|
|
This is a design decision motivated by two main reasons:
|
|
a) In a real system we do not expect utilization scenarios with more than
|
|
a few boost groups. For example, a reasonable collection of groups could
|
|
be just "background", "interactive" and "performance".
|
|
b) It simplifies the implementation considerably, especially for the code
|
|
which has to compute the per CPU boosting once there are multiple
|
|
RUNNABLE tasks with different boost values.
|
|
|
|
Such a simple design should allow servicing the main utilization scenarios
|
|
identified so far. It provides a simple interface which can be used to manage
|
|
the power-performance of all tasks or only selected tasks.
|
|
Moreover, this interface can be easily integrated by user-space run-times (e.g.
|
|
Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
|
|
classification, which has been a long standing requirement.
|
|
|
|
Setup and usage
|
|
---------------
|
|
|
|
0. Use a kernel with CONFIG_SCHED_TUNE support enabled
|
|
|
|
1. Check that the "schedtune" CGroup controller is available:
|
|
|
|
root@linaro-nano:~# cat /proc/cgroups
|
|
#subsys_name hierarchy num_cgroups enabled
|
|
cpuset 0 1 1
|
|
cpu 0 1 1
|
|
schedtune 0 1 1
|
|
|
|
2. Mount a tmpfs to create the CGroups mount point (Optional)
|
|
|
|
root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
|
|
|
|
3. Mount the "schedtune" controller
|
|
|
|
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
|
|
root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
|
|
|
|
4. Create task groups and configure their specific boost value (Optional)
|
|
|
|
For example here we create a "performance" boost group configure to boost
|
|
all its tasks to 100%
|
|
|
|
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
|
|
root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
|
|
|
|
5. Move tasks into the boost group
|
|
|
|
For example, the following moves the tasks with PID $TASKPID (and all its
|
|
threads) into the "performance" boost group.
|
|
|
|
root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
|
|
|
|
This simple configuration allows only the threads of the $TASKPID task to run,
|
|
when needed, at the highest OPP in the most capable CPU of the system.
|
|
|
|
|
|
6. Per-task wakeup-placement-strategy Selection
|
|
===============================================
|
|
|
|
Many devices have a number of CFS tasks in use which require an absolute
|
|
minimum wakeup latency, and many tasks for which wakeup latency is not
|
|
important.
|
|
|
|
For touch-driven environments, removing additional wakeup latency can be
|
|
critical.
|
|
|
|
When you use the Schedtume CGroup controller, you have access to a second
|
|
parameter which allows a group to be marked such that energy_aware task
|
|
placement is bypassed for tasks belonging to that group.
|
|
|
|
prefer_idle=0 (default - use energy-aware task placement if available)
|
|
prefer_idle=1 (never use energy-aware task placement for these tasks)
|
|
|
|
Since the regular wakeup task placement algorithm in CFS is biased for
|
|
performance, this has the effect of restoring minimum wakeup latency
|
|
for the desired tasks whilst still allowing energy-aware wakeup placement
|
|
to save energy for other tasks.
|
|
|
|
|
|
7. Question and Answers
|
|
=======================
|
|
|
|
What about "auto" mode?
|
|
-----------------------
|
|
|
|
The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
|
|
with some suitable user-space element. This element could use the exposed
|
|
system-wide or cgroup based interface.
|
|
|
|
How are multiple groups of tasks with different boost values managed?
|
|
---------------------------------------------------------------------
|
|
|
|
The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
|
|
on a CPU. The CPU utilization seen by schedutil (and used to select an
|
|
appropriate OPP) is boosted with a value which is the maximum of the boost
|
|
values of the currently RUNNABLE tasks in its RQ.
|
|
|
|
This allows cpufreq to boost a CPU only while there are boosted tasks ready
|
|
to run and switch back to the energy efficient mode as soon as the last boosted
|
|
task is dequeued.
|
|
|
|
|
|
8. References
|
|
=============
|
|
[1] http://lwn.net/Articles/552889
|
|
[2] http://lkml.org/lkml/2012/5/18/91
|
|
[3] https://lkml.org/lkml/2016/3/29/1041
|
|
|