Discussion:
[v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
(too old to reply)
Joseph Salisbury
2016-10-07 19:38:38 UTC
Permalink
Hello Peter,

A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 3d30544f02120b884bba2a9466c87dba980e3be5
Author: Peter Zijlstra <***@infradead.org>
Date: Tue Jun 21 14:27:50 2016 +0200

sched/fair: Apply more PELT fixes

The regression was introduced as of v4.8-rc1. The bug can be reproduced
on an X1 Carbon with the following:
stress -c $your_total_cpu_cores

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?


Thanks,

Joe


[0] http://pad.lv/1627108
Linus Torvalds
2016-10-07 19:58:58 UTC
Permalink
On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
Post by Joseph Salisbury
A kernel bug report was opened against Ubuntu [0]. After a kernel
Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..

I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.

Linus
Joseph Salisbury
2016-10-07 20:22:38 UTC
Permalink
Post by Linus Torvalds
On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
Post by Joseph Salisbury
A kernel bug report was opened against Ubuntu [0]. After a kernel
Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..
I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.
Linus
Hi Linus,

Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel. However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue. I can try disabling the config in the 4.8 based kernel and
see if that changes things.

Thanks,

Joe
Linus Torvalds
2016-10-07 20:37:36 UTC
Permalink
On Fri, Oct 7, 2016 at 1:22 PM, Joseph Salisbury
Post by Joseph Salisbury
Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel. However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue. I can try disabling the config in the 4.8 based kernel and
see if that changes things.
No, that wouldn't make any sense. I just wanted to know that the
option was enabled, because that option really *should* help buffer
other processes from one session that is a CPU hog.

So something is seriously wrong in that situation if other things get
very choppy. Of course, the fact that it apparently happens on one
particular machine only means that it's hard to figure out what
triggers it. Maybe some unlucky combination of cpufreq and thermal
throttling by the hardware, coupled with the scheduler change.

Peter?

Linus
Peter Zijlstra
2016-10-08 08:00:43 UTC
Permalink
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
That commit doesn't revert cleanly, did you take out more?
Post by Joseph Salisbury
The regression was introduced as of v4.8-rc1. The bug can be reproduced
stress -c $your_total_cpu_cores
I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?
I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.

If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.

Thanks!
Ingo Molnar
2016-10-08 08:40:43 UTC
Permalink
Post by Peter Zijlstra
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
That commit doesn't revert cleanly, did you take out more?
Note that it reverts cleanly from v4.8 - while it does to revert from current
upstream that did more changes in that area.

I suspect Josheph tested a v4.8-ish kernel.
Post by Peter Zijlstra
I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.
If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.
Also, running the latest scheduler bits would be useful:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

.. just on the off chance that another change in this area fixed the bug, plus to
make it easier to send test patches.

Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8
with the v4.9 scheduler bits applied.

Thanks,

Ingo
Vincent Guittot
2016-10-08 11:38:37 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
Post by Ingo Molnar
Post by Peter Zijlstra
That commit doesn't revert cleanly, did you take out more?
Note that it reverts cleanly from v4.8 - while it does to revert from current
upstream that did more changes in that area.
I suspect Josheph tested a v4.8-ish kernel.
Post by Peter Zijlstra
I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.
If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
... just on the off chance that another change in this area fixed the bug, plus to
make it easier to send test patches.
Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's v4.8
with the v4.9 scheduler bits applied.
Thanks,
Ingo
Mike Galbraith
2016-10-08 11:50:02 UTC
Permalink
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.

-Mike
Vincent Guittot
2016-10-12 12:26:23 UTC
Permalink
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too

Is it possible to get some dump of /proc/sched_debug while the problem occurs ?

Vincent
Post by Mike Galbraith
-Mike
Joseph Salisbury
2016-10-12 15:35:23 UTC
Permalink
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Yes, I requested that data from the bug reporter. I also built a test
kernel and requested testing of the repo Ingo pointed me at. We should
have an update shortly.

Joe
Post by Vincent Guittot
Vincent
Post by Mike Galbraith
-Mike
Joseph Salisbury
2016-10-12 16:21:37 UTC
Permalink
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
The output from /proc/shed_debug can be seen here:
http://paste.ubuntu.com/23312351/

Ingo, the latest scheduler bits also still exhibit the bug:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Vincent Guittot
2016-10-13 11:06:10 UTC
Permalink
Hi,

On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.

The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
running on a CPU:
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991

This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.

Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system

Thanks,
Post by Joseph Salisbury
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Joseph Salisbury
2016-10-13 15:53:06 UTC
Permalink
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.
The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991
This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.
Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system
Here is the dump before stress is started:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy

Here it is after:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
Post by Vincent Guittot
Thanks,
Post by Joseph Salisbury
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Vincent Guittot
2016-10-13 17:05:35 UTC
Permalink
On 13 October 2016 at 17:52, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.
The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991
This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.
Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit
Post by Joseph Salisbury
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?

If i'm right, it mean that the problem was already there before
starting stress tasks.
Post by Joseph Salisbury
Post by Vincent Guittot
Thanks,
Post by Joseph Salisbury
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Dietmar Eggemann
2016-10-13 18:50:24 UTC
Permalink
Post by Vincent Guittot
On 13 October 2016 at 17:52, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.
The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991
This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.
Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit
Post by Joseph Salisbury
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?
If i'm right, it mean that the problem was already there before
starting stress tasks.
Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
and w/o autogroup (tip/sched/core 447976ef4fd0):

When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.

tg_x : 0xffff800975800d80
tg_y_1 : 0xffff800975800c00
tg_y_2 : 0xffff80097543d200

mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5

mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024

mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37

We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
delta=~40 for tg_root/tg_x/tg_y_2.

system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
i5-5300U CPU so only 4 logical cpus.

How many system.slice/foo's actually exists in this system? I only see user.slice
related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.

[...]
Vincent Guittot
2016-10-13 21:35:30 UTC
Permalink
Post by Dietmar Eggemann
Post by Vincent Guittot
On 13 October 2016 at 17:52, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.
The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991
This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.
Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit
Post by Joseph Salisbury
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?
If i'm right, it mean that the problem was already there before
starting stress tasks.
Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.
This is normal as se->avg.load_avg is initialized to
scale_load_down(se->load.weight) and this se->avg.load_avg will be
added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
Post by Dietmar Eggemann
tg_x : 0xffff800975800d80
tg_y_1 : 0xffff800975800c00
tg_y_2 : 0xffff80097543d200
mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5
mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
This looks coherent as tg_x->load_avg = 5120 == sum of
cfs_rq[cpu]->load_avg and cfs_rq->tg_load_avg_contrib will be set to
cfs_rq[cpu]->load_avg at the end of the function call
In the buggy case, tg->load_avg = 381697 but sum of
cfs_rq[cpu]->tg_load_avg_contrib = 1013 = cfs_rq[cpu]->avg.load_avg so
there is no way to remove the 381697
Post by Dietmar Eggemann
mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37
We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
delta=~40 for tg_root/tg_x/tg_y_2.
IMO, this comes from the decay of tg_y_2->se[cpu]->avg.load_avg before
being attached to tg_x

But I don't think that it's the same issue as the tg_x->load_avg is
still coherent with the sum of cfs_rq->avg.load_avg and the sume of
cfs_rq->tg_load_avg_contrib
Post by Dietmar Eggemann
system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
i5-5300U CPU so only 4 logical cpus.
How many system.slice/foo's actually exists in this system? I only see user.slice
There is several dozen of childs in system.slice but all
Post by Dietmar Eggemann
related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.
[...]
Vincent Guittot
2016-10-14 08:24:49 UTC
Permalink
Post by Vincent Guittot
Post by Dietmar Eggemann
Post by Vincent Guittot
On 13 October 2016 at 17:52, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
Post by Joseph Salisbury
Post by Vincent Guittot
Post by Mike Galbraith
Post by Vincent Guittot
Post by Joseph Salisbury
Hello Peter,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit
commit 3d30544f02120b884bba2a9466c87dba980e3be5
Date: Tue Jun 21 14:27:50 2016 +0200
sched/fair: Apply more PELT fixes
This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today
FWIW, I tried and failed w/wo autogroup on 4.8 and master.
Me too
Is it possible to get some dump of /proc/sched_debug while the problem occurs ?
Vincent
Post by Mike Galbraith
-Mike
http://paste.ubuntu.com/23312351/
I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.
The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991
This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.
Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit
Post by Joseph Salisbury
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?
If i'm right, it mean that the problem was already there before
starting stress tasks.
Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o systemd
When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.
This is normal as se->avg.load_avg is initialized to
scale_load_down(se->load.weight) and this se->avg.load_avg will be
added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
Post by Dietmar Eggemann
tg_x : 0xffff800975800d80
tg_y_1 : 0xffff800975800c00
tg_y_2 : 0xffff80097543d200
mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=0
mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=1
mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=2
mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=3
mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=4
mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: tg=0xffff800975800d80 cpu=5
mkdir-2181 [002] 117.353155: bprint: sched_online_group: tg=0xffff800975800c00 tg->parent=0xffff800975800d80
mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=0
mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=1
mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=2
mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=3
mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=4
mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: tg=0xffff800975800c00 cpu=5
mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
This looks coherent as tg_x->load_avg = 5120 == sum of
cfs_rq[cpu]->load_avg and cfs_rq->tg_load_avg_contrib will be set to
cfs_rq[cpu]->load_avg at the end of the function call
In the buggy case, tg->load_avg = 381697 but sum of
cfs_rq[cpu]->tg_load_avg_contrib = 1013 = cfs_rq[cpu]->avg.load_avg so
there is no way to remove the 381697
Post by Dietmar Eggemann
mkdir-2185 [001] 117.502980: bprint: sched_online_group: tg=0xffff80097543d200 tg->parent=0xffff800975800d80
mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=0
mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=1
mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=2
mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=3
mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=4
mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: tg=0xffff80097543d200 cpu=5
mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37
We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 initialization but only
delta=~40 for tg_root/tg_x/tg_y_2.
IMO, this comes from the decay of tg_y_2->se[cpu]->avg.load_avg before
being attached to tg_x
But I don't think that it's the same issue as the tg_x->load_avg is
still coherent with the sum of cfs_rq->avg.load_avg and the sume of
cfs_rq->tg_load_avg_contrib
Post by Dietmar Eggemann
system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, it's an
i5-5300U CPU so only 4 logical cpus.
How many system.slice/foo's actually exists in this system? I only see user.slice
There is several dozen of childs in system.slice but all
but all child have null load_avg and load_avg_contrib
Post by Vincent Guittot
Post by Dietmar Eggemann
related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.
[...]
Dietmar Eggemann
2016-10-14 13:10:23 UTC
Permalink
Post by Vincent Guittot
Post by Dietmar Eggemann
Post by Vincent Guittot
On 13 October 2016 at 17:52, Joseph Salisbury
Post by Vincent Guittot
Hi,
On 12 October 2016 at 18:21, Joseph Salisbury
[...]
Post by Vincent Guittot
Post by Dietmar Eggemann
When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.
This is normal as se->avg.load_avg is initialized to
scale_load_down(se->load.weight) and this se->avg.load_avg will be
added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
Yeah, you right, even when I've created 50 second level groups,
tg_x->load_avg is ~6800.

Could it have something to do with the fact that .se->load.weight = 2
for all these task groups? on a 64bit system?

In case we call __update_load_avg(..., se->on_rq *
scale_load_down(se->load.weight), ...) we pass a weight argument of 0
for these se's.

Does not happen with:

- if (shares < MIN_SHARES)
- shares = MIN_SHARES;
+ if (shares < scale_load(MIN_SHARES))
+ shares = scale_load(MIN_SHARES);

in calc_cfs_shares().

[...]

Loading...