Discussion:
[PATCH v10 00/12] support "task_isolation" mode
(too old to reply)
Chris Metcalf
2016-03-02 20:10:09 UTC
Permalink
Here is the latest version of the task-isolation patch set, adopting
various suggestions made about the v9 patch series, including some
feedback from testing on the new EZchip NPS ARC platform (although the
new arch/arc support is not included in this patch series). All of the
suggestions were relatively non-controversial.

Perhaps we are getting close to being able to merge this. :-)

Changes since v9:

- task_isolation is now set up by adding its required cpus to both the
nohz_full and isolcpus cpumasks. This allows users to separately
specify all three flags, if so desired, and still get reasonably
sane semantics. This is done with a new tick_nohz_full_add_cpus()
method for nohz, and just directly updating the isolcpus cpumask.

- We add a /sys/devices/system/cpu/task_isolation file to parallel
the equivalent nohz_full file. (This should have been in v8 since
once task_isolation isn't equivalent to nohz_full, it needs its
own way to let userspace know where to run.)

- We add a new Kconfig option, TASK_ISOLATION_ALL, which sets all but
the boot processor to run in task isolation mode. This parallels
the existing NO_HZ_FULL_ALL and works around the fact that you can't
easily specify a boot argument with the desired semantics.

- For task_isolation_debug, we add a check of the context_tracking
state of the remote cpu before issuing a warning; if the remote cpu
is actually in the kernel, we don't need to warn.

- A cloned child of a task_isolation task is not enabled for
task_isolation, since otherwise they would both fight over who could
safely return to userspace without requiring scheduling interrupts.

- The quiet_vmstat() function's semantics was changed since the v9
patch series, so I introduce a quiet_vmstat_sync() for isolation.

- The lru_add_drain_needed() function is updated to know about the new
lru_deactivate_pvecs variable.

- The arm64 patch factoring assembly into C has been modified based
on an earlier patch by Mark Rutland.

- I simplified the enabling patch for arm64 by realizing we could just
test TIF_NOHZ as the only bit for TIF_WORK_MASK for task isolation,
so I didn't have to renumber all the TIF_xxx bits.

- Small fixes to avoid preemption warnings.

- Rebased on v4.5-rc5

For changes in earlier versions of the patch series, please see:

http://lkml.kernel.org/r/1451936091-29247-1-git-send-email-***@ezchip.com

A couple of the tile patches that refactored the context tracking
code were taken into 4.5 so are no longer present in this series.

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (12):
vmstat: add quiet_vmstat_sync function
vmstat: add vmstat_idle function
lru_add_drain_all: factor out lru_add_drain_needed
task_isolation: add initial support
task_isolation: support CONFIG_TASK_ISOLATION_ALL
task_isolation: support PR_TASK_ISOLATION_STRICT mode
task_isolation: add debug boot flag
arm, tile: turn off timer tick for oneshot_stopped state
arch/x86: enable task isolation functionality
arch/tile: enable task isolation functionality
arm64: factor work_pending state machine to C
arch/arm64: enable task isolation functionality

Documentation/kernel-parameters.txt | 16 ++
arch/arm64/include/asm/thread_info.h | 8 +-
arch/arm64/kernel/entry.S | 12 +-
arch/arm64/kernel/ptrace.c | 12 +-
arch/arm64/kernel/signal.c | 34 ++++-
arch/arm64/kernel/smp.c | 2 +
arch/arm64/mm/fault.c | 4 +
arch/tile/kernel/process.c | 6 +-
arch/tile/kernel/ptrace.c | 6 +
arch/tile/kernel/single_step.c | 5 +
arch/tile/kernel/smp.c | 28 ++--
arch/tile/kernel/time.c | 1 +
arch/tile/kernel/unaligned.c | 3 +
arch/tile/mm/fault.c | 3 +
arch/tile/mm/homecache.c | 2 +
arch/x86/entry/common.c | 18 ++-
arch/x86/kernel/traps.c | 2 +
arch/x86/mm/fault.c | 2 +
drivers/base/cpu.c | 18 +++
drivers/clocksource/arm_arch_timer.c | 2 +
include/linux/context_tracking_state.h | 6 +
include/linux/isolation.h | 83 +++++++++++
include/linux/sched.h | 3 +
include/linux/swap.h | 1 +
include/linux/tick.h | 1 +
include/linux/vmstat.h | 4 +
include/uapi/linux/prctl.h | 8 +
init/Kconfig | 30 ++++
kernel/Makefile | 1 +
kernel/fork.c | 3 +
kernel/irq_work.c | 5 +-
kernel/isolation.c | 261 +++++++++++++++++++++++++++++++++
kernel/sched/core.c | 18 +++
kernel/signal.c | 5 +
kernel/smp.c | 6 +-
kernel/softirq.c | 33 +++++
kernel/sys.c | 9 ++
kernel/time/tick-sched.c | 31 ++--
mm/swap.c | 15 +-
mm/vmstat.c | 24 +++
40 files changed, 676 insertions(+), 55 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c
--
2.1.2
Chris Metcalf
2016-03-02 20:10:16 UTC
Permalink
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned. For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function. Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 9 +++++++++
2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 73fae8c4a5fb..43b2f1c33266 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -190,6 +190,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -251,6 +252,7 @@ static inline void __dec_zone_page_state(struct page *page,
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 084c6725b373..ff569246fcb5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1469,6 +1469,15 @@ void quiet_vmstat(void)
refresh_cpu_vm_stats(false);
}

+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+ cpumask_set_cpu(smp_processor_id(), cpu_stat_off);
+ cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+ refresh_cpu_vm_stats(false);
+}

/*
* Shepherd worker thread that checks the
--
2.1.2
Chris Metcalf
2016-03-02 20:10:32 UTC
Permalink
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well. The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(). When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent. It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state. In particular, if it sees that the scheduler
tick is still enabled, it reports that it is not yet safe.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache). In addition, it
requests rescheduling if the scheduler dyntick is still running.

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow. However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
Documentation/kernel-parameters.txt | 8 +++
drivers/base/cpu.c | 18 ++++++
include/linux/isolation.h | 53 ++++++++++++++++
include/linux/sched.h | 3 +
include/linux/tick.h | 1 +
include/uapi/linux/prctl.h | 5 ++
init/Kconfig | 20 ++++++
kernel/Makefile | 1 +
kernel/fork.c | 3 +
kernel/isolation.c | 118 ++++++++++++++++++++++++++++++++++++
kernel/sys.c | 9 +++
kernel/time/tick-sched.c | 31 ++++++----
12 files changed, 257 insertions(+), 13 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9a53c929f017..c8d0b42d984a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3747,6 +3747,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.

+ task_isolation= [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION=y, set
+ the specified list of CPUs where cpus will be able
+ to use prctl(PR_SET_TASK_ISOLATION) to set up task
+ isolation mode. Setting this boot flag implicitly
+ also sets up nohz_full and isolcpus mode for the
+ listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
#include <linux/of.h>
#include <linux/cpufeature.h>
#include <linux/tick.h>
+#include <linux/isolation.h>

#include "base.h"

@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
#endif

+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ int n = 0, len = PAGE_SIZE-2;
+
+ n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+ return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
static void cpu_device_release(struct device *dev)
{
/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
#ifdef CONFIG_NO_HZ_FULL
&dev_attr_nohz_full.attr,
#endif
+#ifdef CONFIG_TASK_ISOLATION
+ &dev_attr_task_isolation.attr,
+#endif
#ifdef CONFIG_GENERIC_CPU_AUTOPROBE
&dev_attr_modalias.attr,
#endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..c564cf1886bb
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,53 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+ return tick_nohz_full_enabled() &&
+ cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+ return (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE) &&
+ task_isolation_possible(raw_smp_processor_id());
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+ return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+ if (task_isolation_enabled())
+ _task_isolation_enter();
+}
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..df6eb22510c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1830,6 +1830,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+ unsigned int task_isolation_flags;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 97fd4e543846..b4c0f4dc909f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -158,6 +158,7 @@ extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
#else
static inline int housekeeping_any_cpu(void)
{
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION 48
+#define PR_GET_TASK_ISOLATION 49
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 22320804fbaf..6cab348fe454 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,26 @@ config RCU_EXPEDITE_BOOT

endmenu # "RCU Subsystem"

+config TASK_ISOLATION
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL
+ help
+ Allow userspace processes to place themselves on task_isolation
+ cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+ themselves from the kernel. On return to userspace,
+ isolated tasks will first arrange that no future kernel
+ activity will interrupt the task while the task is running
+ in userspace. This "hard" isolation from the kernel is
+ required for userspace tasks that are running hard real-time
+ tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+ Without this option, but with NO_HZ_FULL enabled, the kernel
+ will make a best-faith, "soft" effort to shield a single userspace
+ process from interrupts, but makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
config BUILD_BIN2C
bool
default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o

obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c754ae7..1890ea0dcd5b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->sequential_io = 0;
p->sequential_io_avg = 0;
#endif
+#ifdef CONFIG_TASK_ISOLATION
+ p->task_isolation_flags = 0; /* do not isolate */
+#endif

/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..e954afd8cce8
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,118 @@
+/*
+ * linux/kernel/isolation.c
+ *
+ * Implementation for task isolation.
+ *
+ * Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed. Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+ saw_boot_arg = true;
+
+ alloc_bootmem_cpumask_var(&task_isolation_map);
+ if (cpulist_parse(str, task_isolation_map) < 0) {
+ pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+ return 1;
+ }
+
+ return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+ /* For offstack cpumask, ensure we allocate an empty cpumask early. */
+ if (!saw_boot_arg) {
+ zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+ return 0;
+ }
+
+ /*
+ * Add our task_isolation cpus to nohz_full and isolcpus. Note
+ * that we are called relatively early in boot, from tick_init();
+ * at this point neither nohz_full nor isolcpus has been used
+ * to configure the system, but isolcpus has been allocated
+ * already in sched_init().
+ */
+ tick_nohz_full_add_cpus(task_isolation_map);
+ cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+ return 0;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL. Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+ !task_isolation_possible(raw_smp_processor_id()))
+ return -EINVAL;
+
+ current->task_isolation_flags = flags;
+ return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again. This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+ WARN_ON_ONCE(!irqs_disabled());
+
+ return (!lru_add_drain_needed(smp_processor_id()) &&
+ vmstat_idle() &&
+ tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+ WARN_ON_ONCE(irqs_disabled());
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ /* Quieten the vmstat worker so it won't interrupt us. */
+ quiet_vmstat_sync();
+
+ /*
+ * Request rescheduling unless we are in full dynticks mode.
+ * We would eventually get pre-empted without this, and if there's
+ * another task waiting, it would run; but by explicitly requesting
+ * the reschedule, we reduce the latency. We could directly call
+ * schedule() here as well, but since our caller is the standard
+ * place where schedule() is called, we defer to the caller.
+ */
+ if (!tick_nohz_tick_stopped())
+ set_tsk_need_resched(current);
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 78947de6f969..c1d621f1137e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/isolation.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_TASK_ISOLATION
+ case PR_SET_TASK_ISOLATION:
+ error = task_isolation_set(arg2);
+ break;
+ case PR_GET_TASK_ISOLATION:
+ error = me->task_isolation_flags;
+ break;
+#endif
default:
error = -EINVAL;
break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 0b17424349eb..14eb7dd06f26 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
#include <linux/posix-timers.h>
#include <linux/perf_event.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>

#include <asm/irq_regs.h>

@@ -311,30 +312,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
return NOTIFY_OK;
}

-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
{
- int err = -1;
+ if (!cpumask_weight(mask))
+ return;

-#ifdef CONFIG_NO_HZ_FULL_ALL
- if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+ if (tick_nohz_full_mask == NULL &&
+ !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
- return err;
+ return;
}
- err = 0;
- cpumask_setall(tick_nohz_full_mask);
+
+ cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
tick_nohz_full_running = true;
-#endif
- return err;
}

void __init tick_nohz_init(void)
{
int cpu;

- if (!tick_nohz_full_running) {
- if (tick_nohz_init_all() < 0)
- return;
- }
+ task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+ if (!tick_nohz_full_running)
+ tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+ if (!tick_nohz_full_running)
+ return;

if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
--
2.1.2
Chris Metcalf
2016-03-02 20:10:37 UTC
Permalink
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves. In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies. Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
include/linux/isolation.h | 25 +++++++++++++++++++
include/uapi/linux/prctl.h | 3 +++
kernel/isolation.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index c564cf1886bb..ba6c4d510db8 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -42,12 +42,37 @@ static inline void task_isolation_enter(void)
_task_isolation_enter();
}

+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+ return ((current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) &&
+ task_isolation_possible(raw_smp_processor_id());
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+ return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...) \
+ do { \
+ if (task_isolation_strict()) \
+ task_isolation_exception(fmt, ## __VA_ARGS__); \
+ } while (0)
+
#else
static inline void task_isolation_init(void) { }
static inline bool task_isolation_possible(int cpu) { return false; }
static inline bool task_isolation_enabled(void) { return false; }
static inline bool task_isolation_ready(void) { return true; }
static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
#endif

#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
#define PR_SET_TASK_ISOLATION 48
#define PR_GET_TASK_ISOLATION 49
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)

#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 42ad7a746a1e..5621fdf15b17 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
#include <linux/vmstat.h>
#include <linux/isolation.h>
#include <linux/syscalls.h>
+#include <asm/unistd.h>
#include "time/tick-sched.h"

cpumask_var_t task_isolation_map;
@@ -122,3 +123,62 @@ void _task_isolation_enter(void)
if (!tick_nohz_tick_stopped())
set_tsk_need_resched(current);
}
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+ siginfo_t info = {};
+ int sig;
+
+ pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+ task->comm, task->pid, buf);
+
+ /*
+ * Turn off task isolation mode entirely to avoid spamming
+ * the process with signals. It can re-enable task isolation
+ * mode in the signal handler if it wants to.
+ */
+ task->task_isolation_flags = 0;
+
+ sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+ if (sig == 0)
+ sig = SIGKILL;
+ info.si_signo = sig;
+ send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ /* RCU should have been enabled prior to this point. */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+
+ task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return false;
+ }
+
+ task_isolation_exception("syscall %d", syscall);
+ return true;
+}
--
2.1.2
Chris Metcalf
2016-03-02 20:10:44 UTC
Permalink
When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns. See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
arch/tile/kernel/time.c | 1 +
drivers/clocksource/arm_arch_timer.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+ .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
};

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index c64d543d64bf..727a669afb1f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -288,6 +288,8 @@ static void __arch_timer_setup(unsigned type,
}
}

+ clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
clk->set_state_shutdown(clk);

clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
--
2.1.2
Chris Metcalf
2016-03-02 20:10:48 UTC
Permalink
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
arch/x86/entry/common.c | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 ++
arch/x86/mm/fault.c | 2 ++
3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03663740c866..27c71165416b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/uprobes.h>
+#include <linux/isolation.h>

#include <asm/desc.h>
#include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
work &= ~_TIF_NOHZ;
}
#endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();

+ task_isolation_enter();
+
/* Disable IRQs and retry */
local_irq_disable();

cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);

- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+ task_isolation_ready())
break;

}
}

+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
/* Called with IRQs disabled. */
__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
{
@@ -278,7 +292,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
cached_flags =
READ_ONCE(pt_regs_to_thread_info(regs)->flags);

- if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (unlikely(cached_flags & EXIT_TO_USERMODE_FLAGS))
exit_to_usermode_loop(regs, cached_flags);

user_enter();
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..82bf53ec1e98 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/io.h>
+#include <linux/isolation.h>

#ifdef CONFIG_EISA
#include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
case 2: /* Bound directory has invalid entry. */
if (mpx_handle_bd_fault())
goto exit_trap;
+ task_isolation_check_exception("bounds check");
break; /* Success, it was handled */
case 1: /* Bound violation. */
info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e830c71a1323..e2b42bc79d81 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
#include <linux/prefetch.h> /* prefetchw */
#include <linux/context_tracking.h> /* exception_enter(), ... */
#include <linux/uaccess.h> /* faulthandler_disabled() */
+#include <linux/isolation.h> /* task_isolation_check_exception */

#include <asm/traps.h> /* dotraplinkage, ... */
#include <asm/pgalloc.h> /* pgd_*(), ... */
@@ -1155,6 +1156,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
local_irq_enable();
error_code |= PF_USER;
flags |= FAULT_FLAG_USER;
+ task_isolation_check_exception("page fault at %#lx", address);
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
--
2.1.2
Andy Lutomirski
2016-03-03 00:37:24 UTC
Permalink
Post by Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.
In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.
---
arch/x86/entry/common.c | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 ++
arch/x86/mm/fault.c | 2 ++
3 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03663740c866..27c71165416b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/uprobes.h>
+#include <linux/isolation.h>
#include <asm/desc.h>
#include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.

And I've still heard no explanation of why this part can't use seccomp.
Post by Chris Metcalf
work &= ~_TIF_NOHZ;
}
#endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();
+ task_isolation_enter();
+
/* Disable IRQs and retry */
local_irq_disable();
cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+ task_isolation_ready())
break;
}
}
+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
TIF_NOHZ is already a hack and IMO this just makes it worse. At the
very least this should have a comment. It really ought to be either a
static_branch or a flag that's actually switched per cpu.

But this is also a confusing change. Don't override the definition
here -- stick it in the header where it belongs.
Chris Metcalf
2016-03-03 19:52:38 UTC
Permalink
Post by Andy Lutomirski
Post by Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.
In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.
---
arch/x86/entry/common.c | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 ++
arch/x86/mm/fault.c | 2 ++
3 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03663740c866..27c71165416b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/uprobes.h>
+#include <linux/isolation.h>
#include <asm/desc.h>
#include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.
And I've still heard no explanation of why this part can't use seccomp.
Here's an excerpt from my earlier reply to you from:

https://lkml.kernel.org/r/***@ezchip.com

Admittedly this patch series has been moving very slowly through
review, so it's not surprising we have to revisit some things!
Post by Andy Lutomirski
Post by Chris Metcalf
If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.
Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.
So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.
Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.
On looking at this again just now, one thing that strikes me is that
it may not be necessary to forbid the syscall like seccomp does.
It may be sufficient just to trigger the task isolation strict signal
and then allow the syscall to complete. After all, we don't "fail"
any of the other things that upset strict mode, like page faults; we
let them complete, but add a signal. So for consistency, I think it
may in fact make sense to simply trigger the signal but let the
syscall do its thing. After all, perhaps the signal is handled
and logged and we don't mind having the application continue; the
signal handler can certainly choose to fail hard, or in the usual
case of no signal handler, that kills the task just fine too.
Allowing the syscall to complete is really kind of incidental.

After the changes proposed lower down in this email, this call
site will end up looking like:

/* Generate a task isolation strict signal if necessary. */
if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict())
task_isolation_syscall(regs->orig_ax);

But do let me know if you think more specifically that there is
a way seccomp can be helpful.
Post by Andy Lutomirski
Post by Chris Metcalf
work &= ~_TIF_NOHZ;
}
#endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();
+ task_isolation_enter();
+
/* Disable IRQs and retry */
local_irq_disable();
cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+ task_isolation_ready())
break;
}
}
+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
TIF_NOHZ is already a hack and IMO this just makes it worse. At the
very least this should have a comment. It really ought to be either a
static_branch or a flag that's actually switched per cpu.
But this is also a confusing change. Don't override the definition
here -- stick it in the header where it belongs.
The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header.

The more I look at this, though, the more I think it would be cleanest
to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have
enabled task isolation. That can be used unconditionally to check to see
if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(),
and also means that non-task-isolation tasks don't need to go into the
exit loop unconditionally, which is obviously a performance drag for them.

So I will move the EXIT_TO_USERMODE_FLAGS definition to immediately follow
the EXIT_TO_USERMODE_LOOP_FLAGS definition and write it this way:

/*
* Task-isolated tasks have to enter exit_to_usermode_loop unconditionally,
* but the state isn't cleared in the loop.
*/
#define EXIT_TO_USERMODE_FLAGS \
(EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_TASK_ISOLATION)

Thanks for the feedback!
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Andy Lutomirski
2016-03-03 23:47:01 UTC
Permalink
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.
In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.
---
arch/x86/entry/common.c | 18 ++++++++++++++++--
arch/x86/kernel/traps.c | 2 ++
arch/x86/mm/fault.c | 2 ++
3 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03663740c866..27c71165416b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/uprobes.h>
+#include <linux/isolation.h>
#include <asm/desc.h>
#include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.
And I've still heard no explanation of why this part can't use seccomp.
Admittedly this patch series has been moving very slowly through
review, so it's not surprising we have to revisit some things!
Post by Andy Lutomirski
Post by Chris Metcalf
If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.
Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.
So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.
Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.
On looking at this again just now, one thing that strikes me is that
it may not be necessary to forbid the syscall like seccomp does.
It may be sufficient just to trigger the task isolation strict signal
and then allow the syscall to complete. After all, we don't "fail"
any of the other things that upset strict mode, like page faults; we
let them complete, but add a signal. So for consistency, I think it
may in fact make sense to simply trigger the signal but let the
syscall do its thing. After all, perhaps the signal is handled
and logged and we don't mind having the application continue; the
signal handler can certainly choose to fail hard, or in the usual
case of no signal handler, that kills the task just fine too.
Allowing the syscall to complete is really kind of incidental.
No, don't do that. First, if you have a signal pending, a lot of
syscalls will abort with -EINTR. Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.
Post by Chris Metcalf
After the changes proposed lower down in this email, this call
/* Generate a task isolation strict signal if necessary. */
if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict())
task_isolation_syscall(regs->orig_ax);
But do let me know if you think more specifically that there is
a way seccomp can be helpful.
Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
work &= ~_TIF_NOHZ;
}
#endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs
*regs, u32 cached_flags)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();
+ task_isolation_enter();
+
/* Disable IRQs and retry */
local_irq_disable();
cached_flags =
READ_ONCE(pt_regs_to_thread_info(regs)->flags);
- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+ task_isolation_ready())
break;
}
}
+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
TIF_NOHZ is already a hack and IMO this just makes it worse. At the
very least this should have a comment. It really ought to be either a
static_branch or a flag that's actually switched per cpu.
But this is also a confusing change. Don't override the definition
here -- stick it in the header where it belongs.
The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header.
The more I look at this, though, the more I think it would be cleanest
to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have
enabled task isolation. That can be used unconditionally to check to see
if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(),
and also means that non-task-isolation tasks don't need to go into the
exit loop unconditionally, which is obviously a performance drag for them.
Sounds better to me.

--Andy
Chris Metcalf
2016-03-07 20:51:58 UTC
Permalink
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.
In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
[...]
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.
And I've still heard no explanation of why this part can't use seccomp.
Admittedly this patch series has been moving very slowly through
review, so it's not surprising we have to revisit some things!
Post by Andy Lutomirski
Post by Chris Metcalf
If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.
Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.
So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.
Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.
On looking at this again just now, one thing that strikes me is that
it may not be necessary to forbid the syscall like seccomp does.
It may be sufficient just to trigger the task isolation strict signal
and then allow the syscall to complete. After all, we don't "fail"
any of the other things that upset strict mode, like page faults; we
let them complete, but add a signal. So for consistency, I think it
may in fact make sense to simply trigger the signal but let the
syscall do its thing. After all, perhaps the signal is handled
and logged and we don't mind having the application continue; the
signal handler can certainly choose to fail hard, or in the usual
case of no signal handler, that kills the task just fine too.
Allowing the syscall to complete is really kind of incidental.
No, don't do that. First, if you have a signal pending, a lot of
syscalls will abort with -EINTR. Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.
OK, you've convinced me to stick with the previous model of just
forbidding the syscall in this case.
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.

Or were you thinking about a new seccomp API that allows this?

Or were you thinking that I could just use seccomp internals, i.e.
allow the prctl() to set a special SECCOMP_MODE_TASK_ISOLATION
and handle it appropriately in seccomp_phase1(), maybe? But, not
touch the actual seccomp() API?

I'm happy to spec something out, but I'd definitely benefit from some
sense from you as to what you think is the better approach.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Andy Lutomirski
2016-03-07 20:55:58 UTC
Permalink
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.
In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
[...]
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.
And I've still heard no explanation of why this part can't use seccomp.
Admittedly this patch series has been moving very slowly through
review, so it's not surprising we have to revisit some things!
Post by Andy Lutomirski
Post by Chris Metcalf
If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.
Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.
So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.
Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.
On looking at this again just now, one thing that strikes me is that
it may not be necessary to forbid the syscall like seccomp does.
It may be sufficient just to trigger the task isolation strict signal
and then allow the syscall to complete. After all, we don't "fail"
any of the other things that upset strict mode, like page faults; we
let them complete, but add a signal. So for consistency, I think it
may in fact make sense to simply trigger the signal but let the
syscall do its thing. After all, perhaps the signal is handled
and logged and we don't mind having the application continue; the
signal handler can certainly choose to fail hard, or in the usual
case of no signal handler, that kills the task just fine too.
Allowing the syscall to complete is really kind of incidental.
No, don't do that. First, if you have a signal pending, a lot of
syscalls will abort with -EINTR. Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.
OK, you've convinced me to stick with the previous model of just
forbidding the syscall in this case.
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Chris Metcalf
2016-03-08 20:40:59 UTC
Permalink
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,

Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.

The basic idea is to add a notion of "removable" seccomp filters.
You can tag a filter that way when installing it (using the new
SECCOMP_FILTER_FLAG_REMOVABLE flag bit for SECCOMP_SET_MODE_FILTER),
and if the most recently-added filter is marked as removable, you can
remove it with the new SECCOMP_POP_FILTER operation. It is currently
implemented to be incompatible with SECCOMP_FILTER_FLAG_TSYNC, which
is plausible since the obvious use is for thread-local push and pop,
but the API allows for future implementation by including a flag word
with the pop_filter operation (now always zero).

I did not make this supported via the prctl() since the "removable"
flag requires seccomp(), so making pop work with prctl() seemed silly.

One interesting result of this is that now it is no longer true
that once current->seccomp.mode becomes non-zero, it may not be
changed, since it can now be changed back to DISABLED when you push a
removable filter and later pop it.

My preference would be not to have to require all task-isolation users
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
push/pop a BPF program into seccomp:

long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
long seccomp_pop_filter(unsigned int flags);

We mark filters from this API with a new "extern_prog" boolean in the
seccomp_filter struct so the BPF program isn't freed when the
seccomp_filter itself is freed. Note that doing it this way avoids
having to go through the substantial overhead of creating a brand-new
BPF filter every time we enter task isolation mode.

Not shown here is the additional code needed in task isolation to
create a suitable BPF program and then push and pop it as we go in and
out of task isolation mode.

For what it's worth, I'm a little dubious about the tradeoff of adding
a substantial chunk of code to seccomp to handle what the v10 task
isolation code did with a single extra TIF flag test and a dozen lines
of code that got called. But given that you said there were other
potential users for the "filter pop" idea, it may indeed make sense.

This is still all untested, but I wanted to get your sense of whether
this was even going in the right direction before spending more time
on it.

Thanks!

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 2296e6b2f690..feeba7a23d20 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,13 +3,15 @@

#include <uapi/linux/seccomp.h>

-#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK \
+ (SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_REMOVABLE)

#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
#include <asm/seccomp.h>

+struct bpf_prog;
struct seccomp_filter;
/**
* struct seccomp - the state of a seccomp'ed process
@@ -41,6 +43,8 @@ static inline int secure_computing(void)

extern u32 seccomp_phase1(struct seccomp_data *sd);
int seccomp_phase2(u32 phase1_result);
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp);
+long seccomp_pop_filter(unsigned int flags);
#else
extern void secure_computing_strict(int this_syscall);
#endif
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a43ff1e..6e65ac2a7262 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,9 +13,11 @@
/* Valid operations for seccomp syscall. */
#define SECCOMP_SET_MODE_STRICT 0
#define SECCOMP_SET_MODE_FILTER 1
+#define SECCOMP_POP_FILTER 2

/* Valid flags for SECCOMP_SET_MODE_FILTER */
#define SECCOMP_FILTER_FLAG_TSYNC 1
+#define SECCOMP_FILTER_FLAG_REMOVABLE 2

/*
* All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 15a1795bbba1..c22eb3a56556 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,8 +41,9 @@
* outside of a lifetime-guarded section. In general, this
* is only needed for handling filters shared across tasks.
* @prev: points to a previously installed, or inherited, filter
- * @len: the number of instructions in the program
- * @insnsi: the BPF program instructions to evaluate
+ * @prog: the BPF program to evaluate
+ * @removable: if this filter is removable with seccomp_pop_filter()
+ * @extern_prog: if @prog should not be freed in seccomp_free_filter()
*
* seccomp_filter objects are organized in a tree linked via the @prev
* pointer. For any task, it appears to be a singly-linked list starting
@@ -58,6 +59,8 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+ bool removable;
+ bool extern_prog;
};

/* Limit any path through the tree to 256KB worth of instructions. */
@@ -470,7 +473,8 @@ void get_seccomp_filter(struct task_struct *tsk)
static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
if (filter) {
- bpf_prog_destroy(filter->prog);
+ if (!filter->extern_prog)
+ bpf_prog_destroy(filter->prog);
kfree(filter);
}
}
@@ -722,6 +726,7 @@ long prctl_get_seccomp(void)
* seccomp_set_mode_strict: internal function for setting strict seccomp
*
* Once current->seccomp.mode is non-zero, it may not be changed.
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
@@ -749,33 +754,34 @@ out:

#ifdef CONFIG_SECCOMP_FILTER
/**
- * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * do_push_filter: internal function for setting seccomp filter
* @flags: flags to change filter behavior
- * @filter: struct sock_fprog containing filter
+ * @prepared: struct seccomp_filter to install
*
* This function may be called repeatedly to install additional filters.
* Every filter successfully installed will be evaluated (in reverse order)
* for each system call the task makes.
*
- * Once current->seccomp.mode is non-zero, it may not be changed.
+ * Once current->seccomp.mode is non-zero, it may not be changed
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode_filter(unsigned int flags,
- const char __user *filter)
+long do_push_filter(unsigned int flags, struct seccomp_filter *prepared)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
- struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;

/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;

- /* Prepare the new filter before holding any locks. */
- prepared = seccomp_prepare_user_filter(filter);
- if (IS_ERR(prepared))
- return PTR_ERR(prepared);
+ if (flags & SECCOMP_FILTER_FLAG_REMOVABLE) {
+ /* The intended use case is for thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ goto out_free;
+ prepared->removable = true;
+ }

/*
* Make sure we cannot change seccomp or nnp state via TSYNC
@@ -805,12 +811,87 @@ out_free:
seccomp_filter_free(prepared);
return ret;
}
+
+static long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
+{
+ struct seccomp_filter *prepared;
+
+ /* Prepare the new filter before holding any locks. */
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);
+ return seccomp_push_filter(flags, prepared);
+}
+
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
+{
+ struct seccomp_filter *sfilter;
+
+ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL);
+ if (!sfilter)
+ return ERR_PTR(-ENOMEM);
+
+ sfilter->prog = fp;
+ sfilter->extern_prog = true;
+ atomic_set(&sfilter->usage, 1);
+
+ return do_push_filter(flags, sfilter);
+}
+
+/**
+ * seccomp_pop_filter: internal function for removing filter
+ * @flags: flags to change pop behavior
+ *
+ * This function removes the most recently installed filter, if it was
+ * installed with the SECCOMP_FILTER_FLAG_REMOVABLE flag. Any previously
+ * installed filters are left intact.
+ *
+ * If the last filter is removed, the seccomp state reverts to DISABLED.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long seccomp_pop_filter(unsigned int flags)
+{
+ struct seccomp_filter *filter;
+
+ /* The intended use case is for temporary thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ return -EINVAL;
+
+ spin_lock_irq(&current->sighand->siglock);
+
+ if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+ goto out;
+
+ filter = current->seccomp.filter;
+ if (unlikely(WARN_ON(filter == NULL)) || !filter->removable)
+ goto out;
+
+ if (filter->prev == NULL) {
+ clear_tsk_thread_flag(current, TIF_SECCOMP);
+ current->seccomp.mode = SECCOMP_MODE_DISABLED;
+ }
+
+ current->seccomp.filter = filter->prev;
+
+ spin_unlock_irq(&current->sighand->siglock);
+ seccomp_filter_free(filter);
+ return 0;
+out:
+ spin_unlock_irq(&current->sighand->siglock);
+ return -EINVAL;
+}
#else
static inline long seccomp_set_mode_filter(unsigned int flags,
const char __user *filter)
{
return -EINVAL;
}
+static inline long seccomp_pop_filter(unsigned int flags)
+{
+ return -EINVAL;
+}
#endif

/* Common entry point for both prctl and syscall. */
@@ -824,6 +905,8 @@ static long do_seccomp(unsigned int op, unsigned int flags,
return seccomp_set_mode_strict();
case SECCOMP_SET_MODE_FILTER:
return seccomp_set_mode_filter(flags, uargs);
+ case SECCOMP_POP_FILTER:
+ return seccomp_pop_filter(flags);
default:
return -EINVAL;
}
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Andy Lutomirski
2016-03-09 20:59:19 UTC
Permalink
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,
Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.
Kees, this sounds like it may solve your self-instrumentation problem.
Want to take a look?

--Andy
Post by Chris Metcalf
The basic idea is to add a notion of "removable" seccomp filters.
You can tag a filter that way when installing it (using the new
SECCOMP_FILTER_FLAG_REMOVABLE flag bit for SECCOMP_SET_MODE_FILTER),
and if the most recently-added filter is marked as removable, you can
remove it with the new SECCOMP_POP_FILTER operation. It is currently
implemented to be incompatible with SECCOMP_FILTER_FLAG_TSYNC, which
is plausible since the obvious use is for thread-local push and pop,
but the API allows for future implementation by including a flag word
with the pop_filter operation (now always zero).
I did not make this supported via the prctl() since the "removable"
flag requires seccomp(), so making pop work with prctl() seemed silly.
One interesting result of this is that now it is no longer true
that once current->seccomp.mode becomes non-zero, it may not be
changed, since it can now be changed back to DISABLED when you push a
removable filter and later pop it.
My preference would be not to have to require all task-isolation users
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
That sounds like a great use case for the new libtaskisolation that
someone is surely writing :)
Post by Chris Metcalf
long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
long seccomp_pop_filter(unsigned int flags);
We mark filters from this API with a new "extern_prog" boolean in the
seccomp_filter struct so the BPF program isn't freed when the
seccomp_filter itself is freed. Note that doing it this way avoids
having to go through the substantial overhead of creating a brand-new
BPF filter every time we enter task isolation mode.
Not shown here is the additional code needed in task isolation to
create a suitable BPF program and then push and pop it as we go in and
out of task isolation mode.
For what it's worth, I'm a little dubious about the tradeoff of adding
a substantial chunk of code to seccomp to handle what the v10 task
isolation code did with a single extra TIF flag test and a dozen lines
of code that got called. But given that you said there were other
potential users for the "filter pop" idea, it may indeed make sense.
This is still all untested, but I wanted to get your sense of whether
this was even going in the right direction before spending more time
on it.
Thanks!
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 2296e6b2f690..feeba7a23d20 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,13 +3,15 @@
#include <uapi/linux/seccomp.h>
-#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK \
+ (SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_REMOVABLE)
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
#include <asm/seccomp.h>
+struct bpf_prog;
struct seccomp_filter;
/**
* struct seccomp - the state of a seccomp'ed process
@@ -41,6 +43,8 @@ static inline int secure_computing(void)
extern u32 seccomp_phase1(struct seccomp_data *sd);
int seccomp_phase2(u32 phase1_result);
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp);
+long seccomp_pop_filter(unsigned int flags);
#else
extern void secure_computing_strict(int this_syscall);
#endif
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a43ff1e..6e65ac2a7262 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,9 +13,11 @@
/* Valid operations for seccomp syscall. */
#define SECCOMP_SET_MODE_STRICT 0
#define SECCOMP_SET_MODE_FILTER 1
+#define SECCOMP_POP_FILTER 2
/* Valid flags for SECCOMP_SET_MODE_FILTER */
#define SECCOMP_FILTER_FLAG_TSYNC 1
+#define SECCOMP_FILTER_FLAG_REMOVABLE 2
/*
* All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 15a1795bbba1..c22eb3a56556 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,8 +41,9 @@
* outside of a lifetime-guarded section. In general, this
* is only needed for handling filters shared across tasks.
*
* pointer. For any task, it appears to be a singly-linked list starting
@@ -58,6 +59,8 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+ bool removable;
+ bool extern_prog;
};
/* Limit any path through the tree to 256KB worth of instructions. */
@@ -470,7 +473,8 @@ void get_seccomp_filter(struct task_struct *tsk)
static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
if (filter) {
- bpf_prog_destroy(filter->prog);
+ if (!filter->extern_prog)
+ bpf_prog_destroy(filter->prog);
kfree(filter);
}
}
@@ -722,6 +726,7 @@ long prctl_get_seccomp(void)
* seccomp_set_mode_strict: internal function for setting strict seccomp
*
* Once current->seccomp.mode is non-zero, it may not be changed.
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
#ifdef CONFIG_SECCOMP_FILTER
/**
- * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * do_push_filter: internal function for setting seccomp filter
*
* This function may be called repeatedly to install additional filters.
* Every filter successfully installed will be evaluated (in reverse order)
* for each system call the task makes.
*
- * Once current->seccomp.mode is non-zero, it may not be changed.
+ * Once current->seccomp.mode is non-zero, it may not be changed
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode_filter(unsigned int flags,
- const char __user *filter)
+long do_push_filter(unsigned int flags, struct seccomp_filter *prepared)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
- struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;
/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
- /* Prepare the new filter before holding any locks. */
- prepared = seccomp_prepare_user_filter(filter);
- if (IS_ERR(prepared))
- return PTR_ERR(prepared);
+ if (flags & SECCOMP_FILTER_FLAG_REMOVABLE) {
+ /* The intended use case is for thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ goto out_free;
+ prepared->removable = true;
+ }
/*
* Make sure we cannot change seccomp or nnp state via TSYNC
seccomp_filter_free(prepared);
return ret;
}
+
+static long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
+{
+ struct seccomp_filter *prepared;
+
+ /* Prepare the new filter before holding any locks. */
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);
+ return seccomp_push_filter(flags, prepared);
+}
+
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
+{
+ struct seccomp_filter *sfilter;
+
+ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL);
+ if (!sfilter)
+ return ERR_PTR(-ENOMEM);
+
+ sfilter->prog = fp;
+ sfilter->extern_prog = true;
+ atomic_set(&sfilter->usage, 1);
+
+ return do_push_filter(flags, sfilter);
+}
+
+/**
+ * seccomp_pop_filter: internal function for removing filter
+ *
+ * This function removes the most recently installed filter, if it was
+ * installed with the SECCOMP_FILTER_FLAG_REMOVABLE flag. Any previously
+ * installed filters are left intact.
+ *
+ * If the last filter is removed, the seccomp state reverts to DISABLED.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long seccomp_pop_filter(unsigned int flags)
+{
+ struct seccomp_filter *filter;
+
+ /* The intended use case is for temporary thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ return -EINVAL;
+
+ spin_lock_irq(&current->sighand->siglock);
+
+ if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+ goto out;
+
+ filter = current->seccomp.filter;
+ if (unlikely(WARN_ON(filter == NULL)) || !filter->removable)
+ goto out;
+
+ if (filter->prev == NULL) {
+ clear_tsk_thread_flag(current, TIF_SECCOMP);
+ current->seccomp.mode = SECCOMP_MODE_DISABLED;
+ }
+
+ current->seccomp.filter = filter->prev;
+
+ spin_unlock_irq(&current->sighand->siglock);
+ seccomp_filter_free(filter);
+ return 0;
+ spin_unlock_irq(&current->sighand->siglock);
+ return -EINVAL;
+}
#else
static inline long seccomp_set_mode_filter(unsigned int flags,
const char __user *filter)
{
return -EINVAL;
}
+static inline long seccomp_pop_filter(unsigned int flags)
+{
+ return -EINVAL;
+}
#endif
/* Common entry point for both prctl and syscall. */
@@ -824,6 +905,8 @@ static long do_seccomp(unsigned int op, unsigned int flags,
return seccomp_set_mode_strict();
return seccomp_set_mode_filter(flags, uargs);
+ return seccomp_pop_filter(flags);
return -EINVAL;
}
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
--
Andy Lutomirski
AMA Capital Management, LLC
Chris Metcalf
2016-03-09 21:06:32 UTC
Permalink
Post by Andy Lutomirski
Post by Chris Metcalf
My preference would be not to have to require all task-isolation users
Post by Chris Metcalf
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
That sounds like a great use case for the new libtaskisolation that
someone is surely writing:)
Happily, task isolation is so simple an API that all that is needed is a prctl().

.. Unless somehow a requirement to inflict a huge blob of eBPF into the kernel
just to use task isolation safely is added, of course :-)
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Andy Lutomirski
2016-03-09 21:08:28 UTC
Permalink
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
My preference would be not to have to require all task-isolation users
Post by Chris Metcalf
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
That sounds like a great use case for the new libtaskisolation that
someone is surely writing:)
Happily, task isolation is so simple an API that all that is needed is a prctl().
... Unless somehow a requirement to inflict a huge blob of eBPF into the
kernel
just to use task isolation safely is added, of course :-)
BPF, not eBPF. Also, it's a tiny blob.

And this still has nothing to do with using it safely. This has to do
with catching your own bugs.

--Andy
Post by Chris Metcalf
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
--
Andy Lutomirski
AMA Capital Management, LLC
Chris Metcalf
2016-03-09 21:13:56 UTC
Permalink
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Chris Metcalf
My preference would be not to have to require all task-isolation users
Post by Chris Metcalf
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
That sounds like a great use case for the new libtaskisolation that
someone is surely writing:)
Happily, task isolation is so simple an API that all that is needed is a prctl().
... Unless somehow a requirement to inflict a huge blob of eBPF into the
kernel just to use task isolation safely is added, of course :-)
BPF, not eBPF. Also, it's a tiny blob.
And this still has nothing to do with using it safely. This has to do
with catching your own bugs.
Fair enough, I suppose. But I was exaggerating for effect: I still think that
this is something that can be easily hidden under the prctl() to avoid adding
a noticeable burden on users who want to be able to catch bugs. (And
those bugs can come from third-party libraries in complex code; the amount
of code in a task-isolation driver is not always easily audited, so having this
kind of a backstop can be pretty useful.)

If you think the basic direction of the previous patch is sound, I'll spin
up the code that hooks it into task isolation, and we can see more directly
whether the tradeoff of a bit more code in the kernel seems worth it.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Kees Cook
2016-03-09 21:11:02 UTC
Permalink
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,
Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.
Kees, this sounds like it may solve your self-instrumentation problem.
Want to take a look?
Errrr... I'm pretty uncomfortable with this. I really would like to
keep the basic semantics of seccomp is simple as possible: filtering
only gets more restricted.

This doesn't really solve my self-instrumentation desires since I
still can't sanely deliver signals. I would need a lot more
convincing. :)
Post by Andy Lutomirski
Post by Chris Metcalf
The basic idea is to add a notion of "removable" seccomp filters.
You can tag a filter that way when installing it (using the new
SECCOMP_FILTER_FLAG_REMOVABLE flag bit for SECCOMP_SET_MODE_FILTER),
and if the most recently-added filter is marked as removable, you can
remove it with the new SECCOMP_POP_FILTER operation. It is currently
implemented to be incompatible with SECCOMP_FILTER_FLAG_TSYNC, which
is plausible since the obvious use is for thread-local push and pop,
but the API allows for future implementation by including a flag word
with the pop_filter operation (now always zero).
I did not make this supported via the prctl() since the "removable"
flag requires seccomp(), so making pop work with prctl() seemed silly.
One interesting result of this is that now it is no longer true
that once current->seccomp.mode becomes non-zero, it may not be
changed, since it can now be changed back to DISABLED when you push a
removable filter and later pop it.
My preference would be not to have to require all task-isolation users
to also figure out all the complexities of creating BPF programs, so
my intention is to have task isolation automatically generate a BPF
program (just allowing prctl/exit/exit_group and failing everything
else with SIGSYS). To support having it work this way, I open up
the seccomp stuff a little so that kernel clients can effectively
That sounds like a great use case for the new libtaskisolation that
someone is surely writing :)
Post by Chris Metcalf
long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
long seccomp_pop_filter(unsigned int flags);
We mark filters from this API with a new "extern_prog" boolean in the
seccomp_filter struct so the BPF program isn't freed when the
seccomp_filter itself is freed. Note that doing it this way avoids
having to go through the substantial overhead of creating a brand-new
BPF filter every time we enter task isolation mode.
Not shown here is the additional code needed in task isolation to
create a suitable BPF program and then push and pop it as we go in and
out of task isolation mode.
For what it's worth, I'm a little dubious about the tradeoff of adding
a substantial chunk of code to seccomp to handle what the v10 task
isolation code did with a single extra TIF flag test and a dozen lines
of code that got called. But given that you said there were other
potential users for the "filter pop" idea, it may indeed make sense.
I think the extra TIF flag makes more sense than overloading seccomp.

-Kees
Post by Andy Lutomirski
Post by Chris Metcalf
This is still all untested, but I wanted to get your sense of whether
this was even going in the right direction before spending more time
on it.
Thanks!
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 2296e6b2f690..feeba7a23d20 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,13 +3,15 @@
#include <uapi/linux/seccomp.h>
-#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK \
+ (SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_REMOVABLE)
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
#include <asm/seccomp.h>
+struct bpf_prog;
struct seccomp_filter;
/**
* struct seccomp - the state of a seccomp'ed process
@@ -41,6 +43,8 @@ static inline int secure_computing(void)
extern u32 seccomp_phase1(struct seccomp_data *sd);
int seccomp_phase2(u32 phase1_result);
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp);
+long seccomp_pop_filter(unsigned int flags);
#else
extern void secure_computing_strict(int this_syscall);
#endif
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a43ff1e..6e65ac2a7262 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,9 +13,11 @@
/* Valid operations for seccomp syscall. */
#define SECCOMP_SET_MODE_STRICT 0
#define SECCOMP_SET_MODE_FILTER 1
+#define SECCOMP_POP_FILTER 2
/* Valid flags for SECCOMP_SET_MODE_FILTER */
#define SECCOMP_FILTER_FLAG_TSYNC 1
+#define SECCOMP_FILTER_FLAG_REMOVABLE 2
/*
* All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 15a1795bbba1..c22eb3a56556 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,8 +41,9 @@
* outside of a lifetime-guarded section. In general, this
* is only needed for handling filters shared across tasks.
*
* pointer. For any task, it appears to be a singly-linked list starting
@@ -58,6 +59,8 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+ bool removable;
+ bool extern_prog;
};
/* Limit any path through the tree to 256KB worth of instructions. */
@@ -470,7 +473,8 @@ void get_seccomp_filter(struct task_struct *tsk)
static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
if (filter) {
- bpf_prog_destroy(filter->prog);
+ if (!filter->extern_prog)
+ bpf_prog_destroy(filter->prog);
kfree(filter);
}
}
@@ -722,6 +726,7 @@ long prctl_get_seccomp(void)
* seccomp_set_mode_strict: internal function for setting strict seccomp
*
* Once current->seccomp.mode is non-zero, it may not be changed.
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
#ifdef CONFIG_SECCOMP_FILTER
/**
- * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * do_push_filter: internal function for setting seccomp filter
*
* This function may be called repeatedly to install additional filters.
* Every filter successfully installed will be evaluated (in reverse order)
* for each system call the task makes.
*
- * Once current->seccomp.mode is non-zero, it may not be changed.
+ * Once current->seccomp.mode is non-zero, it may not be changed
+ * (other than to reset to DISABLED after removing the last removable filter).
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode_filter(unsigned int flags,
- const char __user *filter)
+long do_push_filter(unsigned int flags, struct seccomp_filter *prepared)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
- struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;
/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
- /* Prepare the new filter before holding any locks. */
- prepared = seccomp_prepare_user_filter(filter);
- if (IS_ERR(prepared))
- return PTR_ERR(prepared);
+ if (flags & SECCOMP_FILTER_FLAG_REMOVABLE) {
+ /* The intended use case is for thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ goto out_free;
+ prepared->removable = true;
+ }
/*
* Make sure we cannot change seccomp or nnp state via TSYNC
seccomp_filter_free(prepared);
return ret;
}
+
+static long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
+{
+ struct seccomp_filter *prepared;
+
+ /* Prepare the new filter before holding any locks. */
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);
+ return seccomp_push_filter(flags, prepared);
+}
+
+long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp)
+{
+ struct seccomp_filter *sfilter;
+
+ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL);
+ if (!sfilter)
+ return ERR_PTR(-ENOMEM);
+
+ sfilter->prog = fp;
+ sfilter->extern_prog = true;
+ atomic_set(&sfilter->usage, 1);
+
+ return do_push_filter(flags, sfilter);
+}
+
+/**
+ * seccomp_pop_filter: internal function for removing filter
+ *
+ * This function removes the most recently installed filter, if it was
+ * installed with the SECCOMP_FILTER_FLAG_REMOVABLE flag. Any previously
+ * installed filters are left intact.
+ *
+ * If the last filter is removed, the seccomp state reverts to DISABLED.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long seccomp_pop_filter(unsigned int flags)
+{
+ struct seccomp_filter *filter;
+
+ /* The intended use case is for temporary thread-local push/pop. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ return -EINVAL;
+
+ spin_lock_irq(&current->sighand->siglock);
+
+ if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+ goto out;
+
+ filter = current->seccomp.filter;
+ if (unlikely(WARN_ON(filter == NULL)) || !filter->removable)
+ goto out;
+
+ if (filter->prev == NULL) {
+ clear_tsk_thread_flag(current, TIF_SECCOMP);
+ current->seccomp.mode = SECCOMP_MODE_DISABLED;
+ }
+
+ current->seccomp.filter = filter->prev;
+
+ spin_unlock_irq(&current->sighand->siglock);
+ seccomp_filter_free(filter);
+ return 0;
+ spin_unlock_irq(&current->sighand->siglock);
+ return -EINVAL;
+}
#else
static inline long seccomp_set_mode_filter(unsigned int flags,
const char __user *filter)
{
return -EINVAL;
}
+static inline long seccomp_pop_filter(unsigned int flags)
+{
+ return -EINVAL;
+}
#endif
/* Common entry point for both prctl and syscall. */
@@ -824,6 +905,8 @@ static long do_seccomp(unsigned int op, unsigned int flags,
return seccomp_set_mode_strict();
return seccomp_set_mode_filter(flags, uargs);
+ return seccomp_pop_filter(flags);
return -EINVAL;
}
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
--
Andy Lutomirski
AMA Capital Management, LLC
--
Kees Cook
Chrome OS & Brillo Security
Andy Lutomirski
2016-03-09 21:18:37 UTC
Permalink
Post by Kees Cook
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,
Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.
Kees, this sounds like it may solve your self-instrumentation problem.
Want to take a look?
Errrr... I'm pretty uncomfortable with this. I really would like to
keep the basic semantics of seccomp is simple as possible: filtering
only gets more restricted.
This doesn't really solve my self-instrumentation desires since I
still can't sanely deliver signals. I would need a lot more
convincing. :)
I think you could do it by adding a filter that turns all the unknown
things into SIGSYS, allows sigreturn, and allows the seccomp syscall,
at least in the pop-off-the-filter variant. Then you add this
removably.

In the SIGSYS handler, you pop off the filter, do your bookkeeping,
update the filter, and push it back on.

--Andy
Kees Cook
2016-03-09 21:25:24 UTC
Permalink
Post by Andy Lutomirski
Post by Kees Cook
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,
Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.
Kees, this sounds like it may solve your self-instrumentation problem.
Want to take a look?
Errrr... I'm pretty uncomfortable with this. I really would like to
keep the basic semantics of seccomp is simple as possible: filtering
only gets more restricted.
The other problem is that this won't work if the third-party code
actually uses seccomp itself... this isn't composable as-is.
Post by Andy Lutomirski
Post by Kees Cook
This doesn't really solve my self-instrumentation desires since I
still can't sanely deliver signals. I would need a lot more
convincing. :)
I think you could do it by adding a filter that turns all the unknown
things into SIGSYS, allows sigreturn, and allows the seccomp syscall,
at least in the pop-off-the-filter variant. Then you add this
removably.
In the SIGSYS handler, you pop off the filter, do your bookkeeping,
update the filter, and push it back on.
No, this won't let the original syscall through. I wanted to be able
to document the syscalls as they happened without needing audit or a
ptrace monitor. I am currently convinced that my desire for this is no
good, and it should just be done with a ptrace monitor...

-Kees
Post by Andy Lutomirski
--Andy
--
Kees Cook
Chrome OS & Brillo Security
Andy Lutomirski
2016-03-09 21:58:21 UTC
Permalink
[adding Kenton -- you do interesting things with seccomp, too]
Post by Kees Cook
Post by Andy Lutomirski
Post by Kees Cook
Post by Andy Lutomirski
Post by Chris Metcalf
Post by Andy Lutomirski
Post by Andy Lutomirski
Let task isolation users who want to detect when they screw up and do
Post by Andy Lutomirski
a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
I was. This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.
Andy,
Please take a look at this draft patch that intends to enable seccomp
as something that task isolation can use.
Kees, this sounds like it may solve your self-instrumentation problem.
Want to take a look?
Errrr... I'm pretty uncomfortable with this. I really would like to
keep the basic semantics of seccomp is simple as possible: filtering
only gets more restricted.
The other problem is that this won't work if the third-party code
actually uses seccomp itself... this isn't composable as-is.
It kind of is. Set it up to trap to SIGSYS on any unexpected
seccomp() call. Then emulate it.

To make this slightly cleaner, there could be a variant of the flag in
which a poppable seccomp filter is set to pop itself if it generates
SIGSYS. Then you'd make sure to always trap seccomp and sigaction --
you'd have to emulate those two for composability.

Presumably, for sanity, it would be illegal to have any filter at all
stacked on top of this type of filter. We'd also probably want to
prevent installation of a non-poppable filter on top of a poppable
filter -- that wouldn't make much sense.

Just to muddy the waters, there's another possible use case for this:
a sandbox program could mprotect all its critical data structures
readonly or even PROT_NONE, set up sigaltstack and a very carefully
written SIGSYS handler, install a self-popping signal handler that
turns *everything* into SIGSYS, and then jump to untrusted code. Now
we can finally have a trusted in-process supervisor that can't be
tampered with because it's only privileged if it's entered through a
special gate (i.e. SIGSYS).
Post by Kees Cook
Post by Andy Lutomirski
Post by Kees Cook
This doesn't really solve my self-instrumentation desires since I
still can't sanely deliver signals. I would need a lot more
convincing. :)
I think you could do it by adding a filter that turns all the unknown
things into SIGSYS, allows sigreturn, and allows the seccomp syscall,
at least in the pop-off-the-filter variant. Then you add this
removably.
In the SIGSYS handler, you pop off the filter, do your bookkeeping,
update the filter, and push it back on.
No, this won't let the original syscall through. I wanted to be able
to document the syscalls as they happened without needing audit or a
ptrace monitor. I am currently convinced that my desire for this is no
good, and it should just be done with a ptrace monitor...
It can let the original through -- just do the arch-specific restart
incantation before you return. On x86, reload EAX/RAX and backtrack
IP by 2. As of Linux 4.4, the code for this is sane and has a good
test case. On older 64-bit kernels on AMD running 32-bit code, there
could be odd side effects.

If I ever factor out my SIGSYS decoder, I'll add a restart helper.

--Andy
Chris Metcalf
2016-03-02 20:10:53 UTC
Permalink
Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
arch/arm64/kernel/entry.S | 12 ++++--------
arch/arm64/kernel/signal.c | 30 ++++++++++++++++++++++--------
2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 1f7f5a2b61bf..966d0d4308f2 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -674,18 +674,13 @@ ret_fast_syscall_trace:
* Ok, we need to do extra processing, enter the slow path.
*/
work_pending:
- tbnz x1, #TIF_NEED_RESCHED, work_resched
- /* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
mov x0, sp // 'regs'
- enable_irq // enable interrupts for do_notify_resume()
bl do_notify_resume
- b ret_to_user
-work_resched:
#ifdef CONFIG_TRACE_IRQFLAGS
- bl trace_hardirqs_off // the IRQs are off here, inform the tracing code
+ bl trace_hardirqs_on // enabled while in userspace
#endif
- bl schedule
-
+ ldr x1, [tsk, #TI_FLAGS] // re-check for single-step
+ b finish_ret_to_user
/*
* "slow" syscall return path.
*/
@@ -694,6 +689,7 @@ ret_to_user:
ldr x1, [tsk, #TI_FLAGS]
and x2, x1, #_TIF_WORK_MASK
cbnz x2, work_pending
+finish_ret_to_user:
enable_step_tsk x1, x2
kernel_exit 0
ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..3432e14b7d6e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,29 @@ static void do_signal(struct pt_regs *regs)
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned int thread_flags)
{
- if (thread_flags & _TIF_SIGPENDING)
- do_signal(regs);
+ while (true) {

- if (thread_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
+ if (thread_flags & _TIF_NEED_RESCHED) {
+ schedule();
+ } else {
+ local_irq_enable();
+
+ if (thread_flags & _TIF_SIGPENDING)
+ do_signal(regs);

- if (thread_flags & _TIF_FOREIGN_FPSTATE)
- fpsimd_restore_current_state();
+ if (thread_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (thread_flags & _TIF_FOREIGN_FPSTATE)
+ fpsimd_restore_current_state();
+ }

+ local_irq_disable();
+
+ thread_flags = READ_ONCE(current_thread_info()->flags);
+ if (!(thread_flags & _TIF_WORK_MASK))
+ break;
+ }
}
--
2.1.2
Will Deacon
2016-03-04 16:39:36 UTC
Permalink
Hi Chris,
Post by Chris Metcalf
Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.
This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.
This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.
[...]
Post by Chris Metcalf
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 1f7f5a2b61bf..966d0d4308f2 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
* Ok, we need to do extra processing, enter the slow path.
*/
- tbnz x1, #TIF_NEED_RESCHED, work_resched
- /* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
mov x0, sp // 'regs'
- enable_irq // enable interrupts for do_notify_resume()
bl do_notify_resume
- b ret_to_user
#ifdef CONFIG_TRACE_IRQFLAGS
- bl trace_hardirqs_off // the IRQs are off here, inform the tracing code
+ bl trace_hardirqs_on // enabled while in userspace
This doesn't look right to me. We only get here after running
do_notify_resume, which returns with interrupts disabled.

Do we not instead need to inform the tracing code that interrupts are
disabled prior to calling do_notify_resume?
Post by Chris Metcalf
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..3432e14b7d6e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,29 @@ static void do_signal(struct pt_regs *regs)
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned int thread_flags)
{
- if (thread_flags & _TIF_SIGPENDING)
- do_signal(regs);
+ while (true) {
- if (thread_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
+ if (thread_flags & _TIF_NEED_RESCHED) {
+ schedule();
+ } else {
+ local_irq_enable();
+
+ if (thread_flags & _TIF_SIGPENDING)
+ do_signal(regs);
- if (thread_flags & _TIF_FOREIGN_FPSTATE)
- fpsimd_restore_current_state();
+ if (thread_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (thread_flags & _TIF_FOREIGN_FPSTATE)
+ fpsimd_restore_current_state();
+ }
+ local_irq_disable();
+
+ thread_flags = READ_ONCE(current_thread_info()->flags);
+ if (!(thread_flags & _TIF_WORK_MASK))
+ break;
+ }
This might be easier to read as a do { ... } while.

Will
Chris Metcalf
2016-03-04 20:03:16 UTC
Permalink
Post by Will Deacon
Hi Chris,
Post by Chris Metcalf
Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.
This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.
This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.
[...]
Post by Chris Metcalf
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 1f7f5a2b61bf..966d0d4308f2 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
* Ok, we need to do extra processing, enter the slow path.
*/
- tbnz x1, #TIF_NEED_RESCHED, work_resched
- /* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
mov x0, sp // 'regs'
- enable_irq // enable interrupts for do_notify_resume()
bl do_notify_resume
- b ret_to_user
#ifdef CONFIG_TRACE_IRQFLAGS
- bl trace_hardirqs_off // the IRQs are off here, inform the tracing code
+ bl trace_hardirqs_on // enabled while in userspace
This doesn't look right to me. We only get here after running
do_notify_resume, which returns with interrupts disabled.
Do we not instead need to inform the tracing code that interrupts are
disabled prior to calling do_notify_resume?
I think you are right about the trace_hardirqs_off prior to
calling into do_notify_resume, given Catalin's recent commit to
add it. I dropped it since I was moving schedule() into C code,
but I suspect we'll see the same problem that Catalin saw with
CONFIG_TRACE_IRQFLAGS without it. I'll copy the arch/arm approach
and add a trace_hardirqs_off() at the top of do_notify_resume().

The trace_hardirqs_on I was copying from Mark Rutland's earlier patch:

http://permalink.gmane.org/gmane.linux.ports.arm.kernel/467781

I don't know if it's necessary to flag that interrupts are enabled
prior to returning to userspace; it may well not be. Mark, can you
comment on what led you to add that trace_hardirqs_on?

For now I've left both of them in there.
Post by Will Deacon
Post by Chris Metcalf
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..3432e14b7d6e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,29 @@ static void do_signal(struct pt_regs *regs)
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned int thread_flags)
{
- if (thread_flags & _TIF_SIGPENDING)
- do_signal(regs);
+ while (true) {
[...]
+ }
This might be easier to read as a do { ... } while.
Yes, and in fact that's how I did it for arch/tile, as the maintainer.
I picked up the arch/x86 version as more canonical to copy. But I'm
more than happy to do it the other way :-). Fixed.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
Mark Rutland
2016-03-14 10:29:52 UTC
Permalink
Hi,
Post by Chris Metcalf
Post by Will Deacon
Hi Chris,
Post by Chris Metcalf
Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.
This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.
This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.
[...]
Post by Chris Metcalf
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 1f7f5a2b61bf..966d0d4308f2 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
* Ok, we need to do extra processing, enter the slow path.
*/
- tbnz x1, #TIF_NEED_RESCHED, work_resched
- /* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
mov x0, sp // 'regs'
- enable_irq // enable interrupts for do_notify_resume()
bl do_notify_resume
- b ret_to_user
#ifdef CONFIG_TRACE_IRQFLAGS
- bl trace_hardirqs_off // the IRQs are off here, inform the tracing code
+ bl trace_hardirqs_on // enabled while in userspace
This doesn't look right to me. We only get here after running
do_notify_resume, which returns with interrupts disabled.
Do we not instead need to inform the tracing code that interrupts are
disabled prior to calling do_notify_resume?
I think you are right about the trace_hardirqs_off prior to
calling into do_notify_resume, given Catalin's recent commit to
add it. I dropped it since I was moving schedule() into C code,
but I suspect we'll see the same problem that Catalin saw with
CONFIG_TRACE_IRQFLAGS without it. I'll copy the arch/arm approach
and add a trace_hardirqs_off() at the top of do_notify_resume().
http://permalink.gmane.org/gmane.linux.ports.arm.kernel/467781
I don't know if it's necessary to flag that interrupts are enabled
prior to returning to userspace; it may well not be. Mark, can you
comment on what led you to add that trace_hardirqs_on?
From what I recall, we didn't properly trace enabling IRQs in all the
asm entry paths from userspace, and doing this made things appear
balanced to the tracing code (as the existing behaviour of masking IRQs
in assembly did).

It was more expedient / simpler than fixing all the entry assembly to
update the IRQ tracing state correctly, which I had expected to rework
if/when moving the rest to C.

Thanks,
Mark.

Chris Metcalf
2016-03-02 20:10:57 UTC
Permalink
In do_notify_resume(), call task_isolation_ready() when we are
checking the thread-info flags; and after we've handled the other
work, call task_isolation_enter() unconditionally. To ensure we
always call task_isolation_enter() when returning to userspace,
modify _TIF_WORK_MASK to be _TIF_NOHZ, which is set in every task,
when we build with TASK_ISOLATION configured.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one. This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_cross_call() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
arch/arm64/include/asm/thread_info.h | 8 +++++++-
arch/arm64/kernel/ptrace.c | 12 +++++++++---
arch/arm64/kernel/signal.c | 6 +++++-
arch/arm64/kernel/smp.c | 2 ++
arch/arm64/mm/fault.c | 4 ++++
5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..89c72888cb54 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -131,9 +131,15 @@ static inline struct thread_info *current_thread_info(void)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_32BIT (1 << TIF_32BIT)

-#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)

+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK _TIF_NOHZ /* always set */
+#else
+# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK
+#endif
+
#define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
_TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index ff7f13239515..43aa6d016f46 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/isolation.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1246,14 +1247,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,

asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
- /* Do the secure computing check first; failures should be fast. */
+ unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+ if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+ return -1;
+
+ /* Do the secure computing check early; failures should be fast. */
if (secure_computing() == -1)
return -1;

- if (test_thread_flag(TIF_SYSCALL_TRACE))
+ if (work & _TIF_SYSCALL_TRACE)
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

- if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+ if (work & _TIF_SYSCALL_TRACEPOINT)
trace_sys_enter(regs, regs->syscallno);

audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 3432e14b7d6e..53fcd6c305d6 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
#include <linux/uaccess.h>
#include <linux/tracehook.h>
#include <linux/ratelimit.h>
+#include <linux/isolation.h>

#include <asm/debug-monitors.h>
#include <asm/elf.h>
@@ -419,12 +420,15 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,

if (thread_flags & _TIF_FOREIGN_FPSTATE)
fpsimd_restore_current_state();
+
+ task_isolation_enter();
}

local_irq_disable();

thread_flags = READ_ONCE(current_thread_info()->flags);
- if (!(thread_flags & _TIF_WORK_MASK))
+ if (!(thread_flags & _TIF_WORK_LOOP_MASK) &&
+ task_isolation_ready())
break;
}
}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51b2c2e..dcb3282d04a2 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
#include <linux/completion.h>
#include <linux/of.h>
#include <linux/irq_work.h>
+#include <linux/isolation.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -632,6 +633,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
{
trace_ipi_raise(target, ipi_types[ipinr]);
+ task_isolation_debug_cpumask(target);
__smp_cross_call(target, ipinr);
}

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index abe2a9542b3a..644cd634dd1d 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/perf_event.h>
+#include <linux/isolation.h>

#include <asm/cpufeature.h>
#include <asm/exception.h>
@@ -473,6 +474,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
const struct fault_info *inf = fault_info + (esr & 63);
struct siginfo info;

+ if (user_mode(regs))
+ task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
if (!inf->fn(addr, esr, regs))
return;
--
2.1.2
Chris Metcalf
2016-03-02 20:11:45 UTC
Permalink
The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode. Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel. But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
Documentation/kernel-parameters.txt | 8 ++++
include/linux/context_tracking_state.h | 6 +++
include/linux/isolation.h | 5 +++
kernel/irq_work.c | 5 ++-
kernel/isolation.c | 77 ++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 18 ++++++++
kernel/signal.c | 5 +++
kernel/smp.c | 6 ++-
kernel/softirq.c | 33 +++++++++++++++
9 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index c8d0b42d984a..ea0434fa906e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3755,6 +3755,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.

+ task_isolation_debug [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION
+ and booted in task_isolation= mode, this
+ setting will generate console backtraces when
+ the kernel is about to interrupt a task that
+ has requested PR_TASK_ISOLATION_ENABLE and is
+ running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
{
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
}
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+ return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
#else
static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
static inline bool context_tracking_active(void) { return false; }
static inline bool context_tracking_is_enabled(void) { return false; }
static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ba6c4d510db8..f1ae7b663746 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -45,6 +45,9 @@ static inline void task_isolation_enter(void)
extern bool task_isolation_syscall(int nr);
extern void task_isolation_exception(const char *fmt, ...);
extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);

static inline bool task_isolation_strict(void)
{
@@ -73,6 +76,8 @@ static inline bool task_isolation_ready(void) { return true; }
static inline void task_isolation_enter(void) { }
static inline bool task_isolation_check_syscall(int nr) { return false; }
static inline void task_isolation_check_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
#endif

#endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/isolation.h>
#include <asm/processor.h>


@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ task_isolation_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 5621fdf15b17..6ac35988b49a 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
#include <linux/vmstat.h>
#include <linux/isolation.h>
#include <linux/syscalls.h>
+#include <linux/ratelimit.h>
#include <asm/unistd.h>
#include "time/tick-sched.h"

@@ -182,3 +183,79 @@ bool task_isolation_syscall(int syscall)
task_isolation_exception("syscall %d", syscall);
return true;
}
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+ task_isolation_debug_flag = true;
+ return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p)
+{
+ static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+ bool force_debug = false;
+
+ /*
+ * Our caller made sure the task was running on a task isolation
+ * core, but make sure the task has enabled isolation.
+ */
+ if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+ return;
+
+ /*
+ * Ensure the task is actually in userspace; if it is in kernel
+ * mode, it is expected that it may receive interrupts, and in
+ * any case they don't affect the isolation. Note that there
+ * is a race condition here as a task may have committed
+ * to returning to user space but not yet set the context
+ * tracking state to reflect it, and the check here is before
+ * we trigger the interrupt, so we might fail to warn about a
+ * legitimate interrupt. However, the race window is narrow
+ * and hitting it does not cause any incorrect behavior other
+ * than failing to send the warning.
+ */
+ if (!context_tracking_cpu_in_user(cpu))
+ return;
+
+ /*
+ * If the task was in strict mode, deliver a signal to it.
+ * We disable task isolation mode when we deliver a signal
+ * so we won't end up recursing back here again.
+ * If we are in an NMI, we don't try delivering the signal
+ * and instead just treat it as if "debug" mode was enabled,
+ * since that's pretty much all we can do.
+ */
+ if (p->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+ if (in_nmi())
+ force_debug = true;
+ else
+ task_isolation_interrupt(p, "interrupt");
+ }
+
+ /*
+ * If (for example) the timer interrupt starts ticking
+ * unexpectedly, we will get an unmanageable flow of output,
+ * so limit to one backtrace per second.
+ */
+ if (force_debug ||
+ (task_isolation_debug_flag && __ratelimit(&console_output))) {
+ pr_err("Interrupt detected for task_isolation cpu %d, %s/%d\n",
+ cpu, p->comm, p->pid);
+ dump_stack();
+ }
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask)
+{
+ int cpu, thiscpu = get_cpu();
+
+ /* No need to report on this cpu since we're already in the kernel. */
+ for_each_cpu(cpu, mask)
+ if (cpu != thiscpu)
+ task_isolation_debug(cpu);
+
+ put_cpu();
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9503d590e5ef..f53e1417aa75 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/binfmts.h>
#include <linux/context_tracking.h>
#include <linux/compiler.h>
+#include <linux/isolation.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -746,6 +747,23 @@ bool sched_can_stop_tick(void)
}
#endif /* CONFIG_NO_HZ_FULL */

+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+ struct task_struct *p;
+
+ if (!task_isolation_possible(cpu))
+ return;
+
+ rcu_read_lock();
+ p = cpu_curr(cpu);
+ get_task_struct(p);
+ rcu_read_unlock();
+ task_isolation_debug_task(cpu, p);
+ put_task_struct(p);
+}
+#endif
+
void sched_avg_update(struct rq *rq)
{
s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 0508544c8ced..b35ce19c01c4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -638,6 +638,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+#ifdef CONFIG_TASK_ISOLATION
+ /* If the task is being killed, don't complain about task_isolation. */
+ if (state & TASK_WAKEKILL)
+ t->task_isolation_flags = 0;
+#endif
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index d903c02223af..a61894409645 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
+#include <linux/isolation.h>

#include "smpboot.h"

@@ -178,8 +179,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
- if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+ if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+ task_isolation_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+ }

return 0;
}
@@ -457,6 +460,7 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ task_isolation_debug_cpumask(cfd->cpumask);
arch_send_call_function_ipi_mask(cfd->cpumask);

if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..f249b71cddf4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
#include <linux/smpboot.h>
#include <linux/tick.h>
#include <linux/irq.h>
+#include <linux/isolation.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
local_irq_restore(flags);
}

+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+ struct pt_regs *regs;
+
+ if (!context_tracking_cpu_is_enabled())
+ return;
+
+ /*
+ * We have not yet called __irq_enter() and so we haven't
+ * adjusted the hardirq count. This test will allow us to
+ * avoid false positives for nested IRQs.
+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * If we were already in the kernel, not from an irq but from
+ * a syscall or synchronous exception/fault, this test should
+ * avoid a false positive as well. Note that this requires
+ * architecture support for calling set_irq_regs() prior to
+ * calling irq_enter(), and if it's not done consistently, we
+ * will not consistently avoid false positives here.
+ */
+ regs = get_irq_regs();
+ if (regs && user_mode(regs))
+ task_isolation_debug(smp_processor_id());
+#endif
+}
+
/*
* Enter an interrupt context.
*/
@@ -335,6 +367,7 @@ void irq_enter(void)
_local_bh_enable();
}

+ task_isolation_irq();
__irq_enter();
}
--
2.1.2
Peter Zijlstra
2016-03-02 20:38:01 UTC
Permalink
Post by Chris Metcalf
+void task_isolation_debug(int cpu)
+{
+ struct task_struct *p;
+
+ if (!task_isolation_possible(cpu))
+ return;
+
+ rcu_read_lock();
+ p = cpu_curr(cpu);
+ get_task_struct(p);
As I think Oleg keeps reminding me, this is not actually a safe thing to
do.
Post by Chris Metcalf
+ rcu_read_unlock();
+ task_isolation_debug_task(cpu, p);
+ put_task_struct(p);
+}
Chris Metcalf
2016-03-02 20:56:45 UTC
Permalink
Post by Peter Zijlstra
Post by Chris Metcalf
+void task_isolation_debug(int cpu)
+{
+ struct task_struct *p;
+
+ if (!task_isolation_possible(cpu))
+ return;
+
+ rcu_read_lock();
+ p = cpu_curr(cpu);
+ get_task_struct(p);
As I think Oleg keeps reminding me, this is not actually a safe thing to
do.
So what's the right solution? The fast path in task_isolation_debug_task basically
just uses the new "task_isolation_flags", and "pid" and "comm". I would think those
would all have to be safe because of the get_task_struct().

The piece that might be problematic is the eventual call to send_sig_info() using the
task_struct pointer (called via task_isolation_debug_task -> task_isolation_interrupt).
Clearly this is safe at some level, since that's more or less what sys_kill() does and the
process could similarly evaporate half way through sending the signal.

Suggestions? Thanks!
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Chris Metcalf
2016-03-02 20:12:36 UTC
Permalink
This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
include/linux/swap.h | 1 +
mm/swap.c | 15 ++++++++++-----
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..da21f5240702 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -304,6 +304,7 @@ extern void activate_page(struct page *);
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..bdcdfa21094c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -653,6 +653,15 @@ void deactivate_page(struct page *page)
}
}

+bool lru_add_drain_needed(int cpu)
+{
+ return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+ pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+ need_activate_page_drain(cpu));
+}
+
void lru_add_drain(void)
{
lru_add_drain_cpu(get_cpu());
@@ -679,11 +688,7 @@ void lru_add_drain_all(void)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
- pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
- need_activate_page_drain(cpu)) {
+ if (lru_add_drain_needed(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
cpumask_set_cpu(cpu, &has_work);
--
2.1.2
Chris Metcalf
2016-03-02 20:12:50 UTC
Permalink
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.
---
init/Kconfig | 10 ++++++++++
kernel/isolation.c | 6 ++++++
2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 6cab348fe454..314b09347fba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -802,6 +802,16 @@ config TASK_ISOLATION
You should say "N" unless you are intending to run a
high-performance userspace driver or similar task.

+config TASK_ISOLATION_ALL
+ bool "Provide task isolation on all CPUs by default (except CPU 0)"
+ depends on TASK_ISOLATION
+ help
+ If the user doesn't pass the task_isolation boot option to
+ define the range of task isolation CPUs, consider that all
+ CPUs in the system are task isolation by default.
+ Note the boot CPU will still be kept outside the range to
+ handle timekeeping duty, etc.
+
config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index e954afd8cce8..42ad7a746a1e 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -40,8 +40,14 @@ int __init task_isolation_init(void)
{
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+ alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+ cpumask_copy(task_isolation_map, cpu_possible_mask);
+ cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}

/*
--
2.1.2
Andi Kleen
2016-03-03 18:34:25 UTC
Permalink
Post by Chris Metcalf
+config TASK_ISOLATION_ALL
+ bool "Provide task isolation on all CPUs by default (except CPU 0)"
+ depends on TASK_ISOLATION
+ help
+ If the user doesn't pass the task_isolation boot option to
+ define the range of task isolation CPUs, consider that all
+ CPUs in the system are task isolation by default.
+ Note the boot CPU will still be kept outside the range to
+ handle timekeeping duty, etc.
That seems like a very dangerous Kconfig option.
"CONFIG_BREAK_EVERYTHING"
If someone sets that by default they will have a lot of trouble.

I wouldn't add that, make it a run time option only.

-Andi
--
***@linux.intel.com -- Speaking for myself only
Chris Metcalf
2016-03-03 19:40:44 UTC
Permalink
Post by Andi Kleen
Post by Chris Metcalf
+config TASK_ISOLATION_ALL
+ bool "Provide task isolation on all CPUs by default (except CPU 0)"
+ depends on TASK_ISOLATION
+ help
+ If the user doesn't pass the task_isolation boot option to
+ define the range of task isolation CPUs, consider that all
+ CPUs in the system are task isolation by default.
+ Note the boot CPU will still be kept outside the range to
+ handle timekeeping duty, etc.
That seems like a very dangerous Kconfig option.
"CONFIG_BREAK_EVERYTHING"
If someone sets that by default they will have a lot of trouble.
I wouldn't add that, make it a run time option only.
So you were thinking, allow a special boot syntax "task_isolation=all",
which puts all the cores into task isolation mode except the boot core?

My original argument was that it was so parallel to the existing
CONFIG_NO_HZ_FULL_ALL option that it just made sense to do it,
and some testers complained about having to specify the precise
cpu range, so this seemed like an easy fix.

The commit comment for NO_HZ_FULL_ALL (f98823ac758ba) reads:

nohz: New option to default all CPUs in full dynticks range

Provide a new kernel config that defaults all CPUs to be part
of the full dynticks range, except the boot one for timekeeping.

This default setting is overriden by the nohz_full= boot option
if passed by the user.

This is helpful for those who don't need a finegrained range
of full dynticks CPU and also for automated testing.

The same arguments would seem to apply to TASK_ISOLATION_ALL;
note that applications don't actually go into task isolation mode
without issuing the appropriate prctl(), so it shouldn't be too
dangerous if users enable it by mistake. There will be some
extra checks at kernel entry and exit, that's all.

So on balance it still seems like a reasonable choice. Thoughts?
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Andi Kleen
2016-03-03 20:04:55 UTC
Permalink
Post by Chris Metcalf
The same arguments would seem to apply to TASK_ISOLATION_ALL;
note that applications don't actually go into task isolation mode
without issuing the appropriate prctl(), so it shouldn't be too
That's a fair point. If it's entirely opt-in it's probably ok.

-Andi
Ingo Molnar
2016-03-05 12:32:19 UTC
Permalink
Post by Chris Metcalf
Post by Andi Kleen
Post by Chris Metcalf
+config TASK_ISOLATION_ALL
+ bool "Provide task isolation on all CPUs by default (except CPU 0)"
+ depends on TASK_ISOLATION
+ help
+ If the user doesn't pass the task_isolation boot option to
+ define the range of task isolation CPUs, consider that all
+ CPUs in the system are task isolation by default.
+ Note the boot CPU will still be kept outside the range to
+ handle timekeeping duty, etc.
That seems like a very dangerous Kconfig option.
"CONFIG_BREAK_EVERYTHING"
If someone sets that by default they will have a lot of trouble.
I wouldn't add that, make it a run time option only.
So you were thinking, allow a special boot syntax "task_isolation=all",
which puts all the cores into task isolation mode except the boot core?
My original argument was that it was so parallel to the existing
CONFIG_NO_HZ_FULL_ALL option that it just made sense to do it,
and some testers complained about having to specify the precise
cpu range, so this seemed like an easy fix.
Yes, it's absolutely legitimate to offer boot options as Kconfig options as well -
in fact that will get things like randconfig bootups stumble upon them and do some
free testing for you. Just ignore Andi's nonsensical objection.

One day we'll have a unified boot parameter/Kconfig/sysctl mechanism, so that it
will be possible to say things like this on the boot command line:

CONFIG_NO_HZ_FULL_ALL=y

.. which will eliminate quite a bit of the current schizm between Kconfig and
boot time parameters.

Thanks,

Ingo
Chris Metcalf
2016-03-02 20:26:09 UTC
Permalink
This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update. The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <***@linux.com>
Signed-off-by: Chris Metcalf <***@ezchip.com>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 12 ++++++++++++
2 files changed, 14 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 43b2f1c33266..504ebd1fdf33 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -191,6 +191,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);

void quiet_vmstat(void);
void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -253,6 +254,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }
static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ff569246fcb5..e647628e84c0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1480,6 +1480,18 @@ void quiet_vmstat_sync(void)
}

/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+ int cpu = smp_processor_id();
+ return cpumask_test_cpu(cpu, cpu_stat_off) &&
+ !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+ !need_update(cpu);
+}
+
+/*
* Shepherd worker thread that checks the
* differentials of processors that have their worker
* threads for vm statistics updates disabled because of
--
2.1.2
Chris Metcalf
2016-03-02 20:26:31 UTC
Permalink
We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine. We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_check_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <***@ezchip.com>
---
arch/tile/kernel/process.c | 6 +++++-
arch/tile/kernel/ptrace.c | 6 ++++++
arch/tile/kernel/single_step.c | 5 +++++
arch/tile/kernel/smp.c | 28 ++++++++++++++++------------
arch/tile/kernel/unaligned.c | 3 +++
arch/tile/mm/fault.c | 3 +++
arch/tile/mm/homecache.c | 2 ++
7 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 6594df5fed53..8f739f78780c 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
#include <linux/signal.h>
#include <linux/delay.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>
#include <asm/stack.h>
#include <asm/switch_to.h>
#include <asm/homecache.h>
@@ -495,10 +496,13 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
tracehook_notify_resume(regs);
}

+ task_isolation_enter();
+
local_irq_disable();
thread_info_flags = READ_ONCE(current_thread_info()->flags);

- } while (thread_info_flags & _TIF_WORK_MASK);
+ } while ((thread_info_flags & _TIF_WORK_MASK) ||
+ !task_isolation_ready());

if (thread_info_flags & _TIF_SINGLESTEP) {
single_step_once(regs);
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index 54e7b723db99..f76f2d8b8923 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
#include <linux/elf.h>
#include <linux/tracehook.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>
#include <asm/traps.h>
#include <arch/chip.h>

@@ -255,6 +256,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
{
u32 work = ACCESS_ONCE(current_thread_info()->flags);

+ if (work & _TIF_NOHZ) {
+ if (task_isolation_check_syscall(regs->regs[TREG_SYSCALL_NR]))
+ return -1;
+ }
+
if (secure_computing() == -1)
return -1;

diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..ba01eacde7a3 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
#include <linux/types.h>
#include <linux/err.h>
#include <linux/prctl.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/traps.h>
#include <asm/uaccess.h>
@@ -320,6 +321,8 @@ void single_step_once(struct pt_regs *regs)
int size = 0, sign_ext = 0; /* happy compiler */
int align_ctl;

+ task_isolation_check_exception("single step at %#lx", regs->pc);
+
align_ctl = unaligned_fixup;
switch (task_thread_info(current)->align_ctl) {
case PR_UNALIGN_NOPRINT:
@@ -767,6 +770,8 @@ void single_step_once(struct pt_regs *regs)
unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);

+ task_isolation_check_exception("single step at %#lx", regs->pc);
+
*ss_pc = regs->pc;
control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..da1eb240fc57 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
#include <linux/irq.h>
#include <linux/irq_work.h>
#include <linux/module.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/homecache.h>

@@ -67,6 +68,7 @@ void send_IPI_single(int cpu, int tag)
.x = cpu % smp_width,
.state = HV_TO_BE_SENT
};
+ task_isolation_debug(cpu);
__send_IPI_many(&recip, 1, tag);
}

@@ -84,6 +86,7 @@ void send_IPI_many(const struct cpumask *mask, int tag)
r->x = cpu % smp_width;
r->state = HV_TO_BE_SENT;
}
+ task_isolation_debug_cpumask(mask);
__send_IPI_many(recip, nrecip, tag);
}

@@ -181,10 +184,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
struct ipi_flush flush = { start, end };

/* If invoked with irqs disabled, we can not issue IPIs. */
- if (irqs_disabled())
+ if (irqs_disabled()) {
+ task_isolation_debug_cpumask(task_isolation_map);
flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
NULL, NULL, 0);
- else {
+ } else {
preempt_disable();
on_each_cpu(ipi_flush_icache_range, &flush, 1);
preempt_enable();
@@ -258,10 +262,8 @@ void __init ipi_init(void)

#if CHIP_HAS_IPI()

-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
{
- WARN_ON(cpu_is_offline(cpu));
-
/*
* We just want to do an MMIO store. The traditional writeq()
* functions aren't really correct here, since they're always
@@ -273,15 +275,17 @@ void smp_send_reschedule(int cpu)

#else

-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
{
- HV_Coord coord;
-
- WARN_ON(cpu_is_offline(cpu));
-
- coord.y = cpu_y(cpu);
- coord.x = cpu_x(cpu);
+ HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
hv_trigger_ipi(coord, IRQ_RESCHEDULE);
}

#endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+ WARN_ON(cpu_is_offline(cpu));
+ task_isolation_debug(cpu);
+ __smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 0db5f7c9d9e5..b1e229a1ff62 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
#include <linux/module.h>
#include <linux/compat.h>
#include <linux/prctl.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/traps.h>
#include <asm/uaccess.h>
@@ -1545,6 +1546,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
return;
}

+ task_isolation_check_exception("unaligned JIT at %#lx", regs->pc);
+
if (!info->unalign_jit_base) {
void __user *user_page;

diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..1dee18d3ffbd 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/kdebug.h>
+#include <linux/isolation.h>

#include <asm/pgalloc.h>
#include <asm/sections.h>
@@ -844,6 +845,8 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
void do_page_fault(struct pt_regs *regs, int fault_num,
unsigned long address, unsigned long write)
{
+ task_isolation_check_exception("page fault interrupt %d at %#lx (%#lx)",
+ fault_num, regs->pc, address);
__do_page_fault(regs, fault_num, address, write);
}

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..e044e8dd8372 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/isolation.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
* Don't bother to update atomically; losing a count
* here is not that critical.
*/
+ task_isolation_debug_cpumask(&mask);
for_each_cpu(cpu, &mask)
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
}
--
2.1.2
Loading...