[v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

Discussion:

(too old to reply)

Joseph Salisbury

2016-10-07 19:38:38 UTC

Hello Peter,

A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 3d30544f02120b884bba2a9466c87dba980e3be5
Author: Peter Zijlstra <***@infradead.org>
Date: Tue Jun 21 14:27:50 2016 +0200

sched/fair: Apply more PELT fixes

The regression was introduced as of v4.8-rc1. The bug can be reproduced
on an X1 Carbon with the following:
stress -c $your_total_cpu_cores

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

Thanks,

Joe

[0] http://pad.lv/1627108

Linus Torvalds

2016-10-07 19:58:58 UTC

Permalink

On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury

Post by Joseph Salisbury
A kernel bug report was opened against Ubuntu [0]. After a kernel

Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..

I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.

Linus

Joseph Salisbury

2016-10-07 20:22:38 UTC

Permalink

Post by Linus Torvalds
On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury

Post by Joseph Salisbury
A kernel bug report was opened against Ubuntu [0]. After a kernel

Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..
I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.
Linus

Hi Linus,

Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel. However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue. I can try disabling the config in the 4.8 based kernel and
see if that changes things.

Thanks,

Joe

Linus Torvalds

2016-10-07 20:37:36 UTC

Permalink

On Fri, Oct 7, 2016 at 1:22 PM, Joseph Salisbury

Post by Joseph Salisbury
Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel. However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue. I can try disabling the config in the 4.8 based kernel and
see if that changes things.

No, that wouldn't make any sense. I just wanted to know that the
option was enabled, because that option really *should* help buffer
other processes from one session that is a CPU hog.

So something is seriously wrong in that situation if other things get
very choppy. Of course, the fact that it apparently happens on one
particular machine only means that it's hard to figure out what
triggers it. Maybe some unlucky combination of cpufreq and thermal
throttling by the hardware, coupled with the scheduler change.

Peter?

Linus

Peter Zijlstra

2016-10-08 08:00:43 UTC

Permalink

Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes

That commit doesn't revert cleanly, did you take out more?

Post by Joseph Salisbury
The regression was introduced as of v4.8-rc1. The bug can be reproduced
stress -c $your_total_cpu_cores
I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.

If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.

Thanks!

Ingo Molnar

2016-10-08 08:40:43 UTC

Permalink

Post by Peter Zijlstra

That commit doesn't revert cleanly, did you take out more?

Note that it reverts cleanly from v4.8 - while it does to revert from current
upstream that did more changes in that area.

I suspect Josheph tested a v4.8-ish kernel.

Post by Peter Zijlstra
I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.
If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.

Also, running the latest scheduler bits would be useful:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

.. just on the off chance that another change in this area fixed the bug, plus to
make it easier to send test patches.

Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8
with the v4.9 scheduler bits applied.

Thanks,

Ingo

Vincent Guittot

2016-10-08 11:38:37 UTC

Permalink

Post by Ingo Molnar

Post by Peter Zijlstra

This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today

Post by Ingo Molnar

Post by Peter Zijlstra
That commit doesn't revert cleanly, did you take out more?

Note that it reverts cleanly from v4.8 - while it does to revert from current
upstream that did more changes in that area.
I suspect Josheph tested a v4.8-ish kernel.

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
... just on the off chance that another change in this area fixed the bug, plus to
make it easier to send test patches.
Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8
with the v4.9 scheduler bits applied.
Thanks,
Ingo

Mike Galbraith

2016-10-08 11:50:02 UTC