[PATCH 00/10] implement zsmalloc shrinking

When zsmalloc creates a new zspage, it initializes each object it contains
with a link to the next object, so that the zspage has a singly-linked list
of its free objects. However, the logic that sets up the links is wrong,
and in the case of objects that are precisely aligned with the page boundries
(e.g. a zspage with objects that are 1/2 PAGE_SIZE) the first object on the
next page is skipped, due to incrementing the offset twice. The logic can be
simplified, as it doesn't need to calculate how many objects can fit on the
current page; simply checking the offset for each object is enough.

Change zsmalloc init_zspage() logic to iterate through each object on
each of its pages, checking the offset to verify the object is on the
current page before linking it into the zspage.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c4a9157..03aa72f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -628,7 +628,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
while (page) {
struct page *next_page;
struct link_free *link;
- unsigned int i, objs_on_page;
+ unsigned int i = 1;

/*
* page->index stores offset of first object starting
@@ -641,14 +641,10 @@ static void init_zspage(struct page *first_page, struct size_class *class)

link = (struct link_free *)kmap_atomic(page) +
off / sizeof(*link);
- objs_on_page = (PAGE_SIZE - off) / class->size;

- for (i = 1; i <= objs_on_page; i++) {
- off += class->size;
- if (off < PAGE_SIZE) {
- link->next = obj_location_to_handle(page, i);
- link += class->size / sizeof(*link);
- }
+ while ((off += class->size) < PAGE_SIZE) {
+ link->next = obj_location_to_handle(page, i++);
+ link += class->size / sizeof(*link);
}

/*
@@ -660,7 +656,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link->next = obj_location_to_handle(next_page, 0);
kunmap_atomic(link);
page = next_page;
- off = (off + class->size) % PAGE_SIZE;
+ off %= PAGE_SIZE;
}
}

Seth Jennings

2014-09-12 03:16:56 UTC

Post by Dan Streetman
When zsmalloc creates a new zspage, it initializes each object it contains
with a link to the next object, so that the zspage has a singly-linked list
of its free objects. However, the logic that sets up the links is wrong,
and in the case of objects that are precisely aligned with the page boundries
(e.g. a zspage with objects that are 1/2 PAGE_SIZE) the first object on the
next page is skipped, due to incrementing the offset twice. The logic can be
simplified, as it doesn't need to calculate how many objects can fit on the
current page; simply checking the offset for each object is enough.
Change zsmalloc init_zspage() logic to iterate through each object on
each of its pages, checking the offset to verify the object is on the
current page before linking it into the zspage.

This one stands on its own as a bugfix.

Post by Dan Streetman
---
mm/zsmalloc.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c4a9157..03aa72f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -628,7 +628,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
while (page) {
struct page *next_page;
struct link_free *link;
- unsigned int i, objs_on_page;
+ unsigned int i = 1;
/*
* page->index stores offset of first object starting
@@ -641,14 +641,10 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link = (struct link_free *)kmap_atomic(page) +
off / sizeof(*link);
- objs_on_page = (PAGE_SIZE - off) / class->size;
- for (i = 1; i <= objs_on_page; i++) {
- off += class->size;
- if (off < PAGE_SIZE) {
- link->next = obj_location_to_handle(page, i);
- link += class->size / sizeof(*link);
- }
+ while ((off += class->size) < PAGE_SIZE) {
+ link->next = obj_location_to_handle(page, i++);
+ link += class->size / sizeof(*link);
}
/*
@@ -660,7 +656,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link->next = obj_location_to_handle(next_page, 0);
kunmap_atomic(link);
page = next_page;
- off = (off + class->size) % PAGE_SIZE;
+ off %= PAGE_SIZE;
}
}
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Minchan Kim

2014-09-12 04:59:26 UTC

If objects are precisely aligned with the page boundary, pages_per_zspage
should be 1 so there is no next page.

Post by Dan Streetman
Change zsmalloc init_zspage() logic to iterate through each object on
each of its pages, checking the offset to verify the object is on the
current page before linking it into the zspage.
---
mm/zsmalloc.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c4a9157..03aa72f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -628,7 +628,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
while (page) {
struct page *next_page;
struct link_free *link;
- unsigned int i, objs_on_page;
+ unsigned int i = 1;
/*
* page->index stores offset of first object starting
@@ -641,14 +641,10 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link = (struct link_free *)kmap_atomic(page) +
off / sizeof(*link);
- objs_on_page = (PAGE_SIZE - off) / class->size;
- for (i = 1; i <= objs_on_page; i++) {
- off += class->size;
- if (off < PAGE_SIZE) {
- link->next = obj_location_to_handle(page, i);
- link += class->size / sizeof(*link);
- }
+ while ((off += class->size) < PAGE_SIZE) {
+ link->next = obj_location_to_handle(page, i++);
+ link += class->size / sizeof(*link);
}
/*
@@ -660,7 +656,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link->next = obj_location_to_handle(next_page, 0);
kunmap_atomic(link);
page = next_page;
- off = (off + class->size) % PAGE_SIZE;
+ off %= PAGE_SIZE;
}
}
--
1.8.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dan Streetman

2014-09-12 16:43:52 UTC

If objects are precisely aligned with the page boundary, pages_per_zspage
should be 1 so there is no next page.

ah, ok. I wonder if it should be changed anyway so it doesn't rely on
that detail, in case that's ever changed in the future. It's not
obvious the existing logic relies on that for correct operation. And
this simplifies the logic too.

--
Kind regards,
Minchan Kim

Minchan Kim

2014-09-14 23:24:25 UTC

If objects are precisely aligned with the page boundary, pages_per_zspage
should be 1 so there is no next page.

Correct description and resend if you want.

Dan Streetman

2014-09-15 20:59:27 UTC

Change zsmalloc init_zspage() logic to iterate through each object on
each of its pages, checking the offset to verify the object is on the
current page before linking it into the zspage.

The current zsmalloc init_zspage free object linking code has logic
that relies on there only being one page per zspage when PAGE_SIZE
is a multiple of class->size. It calculates the number of objects
for the current page, and iterates through all of them plus one,
to account for the assumed partial object at the end of the page.
While this currently works, the logic can be simplified to just
link the object at each successive offset until the offset is larger
than PAGE_SIZE, which does not rely on PAGE_SIZE being a multiple of
class->size.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c4a9157..03aa72f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -628,7 +628,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
while (page) {
struct page *next_page;
struct link_free *link;
- unsigned int i, objs_on_page;
+ unsigned int i = 1;

/*
* page->index stores offset of first object starting
@@ -641,14 +641,10 @@ static void init_zspage(struct page *first_page, struct size_class *class)

link = (struct link_free *)kmap_atomic(page) +
off / sizeof(*link);
- objs_on_page = (PAGE_SIZE - off) / class->size;

- for (i = 1; i <= objs_on_page; i++) {
- off += class->size;
- if (off < PAGE_SIZE) {
- link->next = obj_location_to_handle(page, i);
- link += class->size / sizeof(*link);
- }
+ while ((off += class->size) < PAGE_SIZE) {
+ link->next = obj_location_to_handle(page, i++);
+ link += class->size / sizeof(*link);
}

/*
@@ -660,7 +656,7 @@ static void init_zspage(struct page *first_page, struct size_class *class)
link->next = obj_location_to_handle(next_page, 0);
kunmap_atomic(link);
page = next_page;
- off = (off + class->size) % PAGE_SIZE;
+ off %= PAGE_SIZE;
}
}

Minchan Kim

2014-09-16 02:41:03 UTC

Post by Dan Streetman
Change zsmalloc init_zspage() logic to iterate through each object on
each of its pages, checking the offset to verify the object is on the
current page before linking it into the zspage.
The current zsmalloc init_zspage free object linking code has logic
that relies on there only being one page per zspage when PAGE_SIZE
is a multiple of class->size. It calculates the number of objects
for the current page, and iterates through all of them plus one,
to account for the assumed partial object at the end of the page.
While this currently works, the logic can be simplified to just
link the object at each successive offset until the offset is larger
than PAGE_SIZE, which does not rely on PAGE_SIZE being a multiple of
class->size.

Acked-by: Minchan Kim <***@kernel.org>

Dan Streetman

2014-09-11 20:54:57 UTC

Move ZS_FULL into section of fullness_group entries that are tracked in
the class fullness_lists. Without this change, full zspages are untracked
by zsmalloc; they are only moved back onto one of the tracked lists
(ZS_ALMOST_FULL or ZS_ALMOST_EMPTY) when a zsmalloc user frees one or more
of its contained objects.

This is required for zsmalloc shrinking, which needs to be able to search
all zspages in a zsmalloc pool, to find one to shrink.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 03aa72f..fedb70f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -159,16 +159,19 @@
ZS_SIZE_CLASS_DELTA + 1)

/*
- * We do not maintain any list for completely empty or full pages
+ * We do not maintain any list for completely empty zspages,
+ * since a zspage is freed when it becomes empty.
*/
enum fullness_group {
ZS_ALMOST_FULL,
ZS_ALMOST_EMPTY,
+ ZS_FULL,
+
_ZS_NR_FULLNESS_GROUPS,

ZS_EMPTY,
- ZS_FULL
};
+#define _ZS_NR_AVAILABLE_FULLNESS_GROUPS ZS_FULL

/*
* We assign a page to ZS_ALMOST_EMPTY fullness group when:
@@ -722,12 +725,12 @@ cleanup:
return first_page;
}

-static struct page *find_get_zspage(struct size_class *class)
+static struct page *find_available_zspage(struct size_class *class)
{
int i;
struct page *page;

- for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
+ for (i = 0; i < _ZS_NR_AVAILABLE_FULLNESS_GROUPS; i++) {
page = class->fullness_list[i];
if (page)
break;
@@ -1013,7 +1016,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
BUG_ON(class_idx != class->index);

spin_lock(&class->lock);
- first_page = find_get_zspage(class);
+ first_page = find_available_zspage(class);

if (!first_page) {
spin_unlock(&class->lock);

Dan Streetman

2014-09-11 20:55:01 UTC

Update ordering of a changed zspage in its fullness group LRU list,
even if it has not moved to a different fullness group.

This is needed by zsmalloc shrinking, which partially relies on each
class fullness group list to be kept in LRU order, so the oldest can
be reclaimed first. Currently, LRU ordering is only updated when
a zspage changes fullness groups.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index fedb70f..51db622 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -467,16 +467,14 @@ static enum fullness_group fix_fullness_group(struct zs_pool *pool,
BUG_ON(!is_first_page(page));

get_zspage_mapping(page, &class_idx, &currfg);
- newfg = get_fullness_group(page);
- if (newfg == currfg)
- goto out;
-
class = &pool->size_class[class_idx];
+ newfg = get_fullness_group(page);
+ /* Need to do this even if currfg == newfg, to update lru */
remove_zspage(page, class, currfg);
insert_zspage(page, class, newfg);
- set_zspage_mapping(page, class_idx, newfg);
+ if (currfg != newfg)
+ set_zspage_mapping(page, class_idx, newfg);

-out:
return newfg;
}

Seth Jennings

2014-09-12 03:20:40 UTC

Post by Dan Streetman
Update ordering of a changed zspage in its fullness group LRU list,
even if it has not moved to a different fullness group.
This is needed by zsmalloc shrinking, which partially relies on each
class fullness group list to be kept in LRU order, so the oldest can
be reclaimed first. Currently, LRU ordering is only updated when
a zspage changes fullness groups.

Just something I saw.

fix_fullness_group() is called from zs_free(), which means that removing
an object from a zspage moves it to the front of the LRU. Not sure if
that is what we want. If anything that makes it a _better_ candidate
for reclaim as the zspage is now contains fewer objects that we'll have
to decompress and writeback.

Seth

Post by Dan Streetman
---
mm/zsmalloc.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index fedb70f..51db622 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -467,16 +467,14 @@ static enum fullness_group fix_fullness_group(struct zs_pool *pool,
BUG_ON(!is_first_page(page));
get_zspage_mapping(page, &class_idx, &currfg);
- newfg = get_fullness_group(page);
- if (newfg == currfg)
- goto out;
-
class = &pool->size_class[class_idx];
+ newfg = get_fullness_group(page);
+ /* Need to do this even if currfg == newfg, to update lru */
remove_zspage(page, class, currfg);
insert_zspage(page, class, newfg);
- set_zspage_mapping(page, class_idx, newfg);
+ if (currfg != newfg)
+ set_zspage_mapping(page, class_idx, newfg);
return newfg;
}
--
1.8.3.1

Dan Streetman

2014-09-11 20:55:06 UTC

Move the code that frees a zspage object out of the zs_free()
function and into its own obj_free() function.

This is required by zsmalloc shrinking, which will also need to
free objects during zspage reclaiming.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 51db622..cff8935 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -584,6 +584,21 @@ static unsigned long obj_idx_to_offset(struct page *page,
return off + obj_idx * class_size;
}

+static void obj_free(unsigned long obj, struct page *page, unsigned long offset)
+{
+ struct page *first_page = get_first_page(page);
+ struct link_free *link;
+
+ /* Insert this object in containing zspage's freelist */
+ link = (struct link_free *)((unsigned char *)kmap_atomic(page)
+ + offset);
+ link->next = first_page->freelist;
+ kunmap_atomic(link);
+ first_page->freelist = (void *)obj;
+
+ first_page->inuse--;
+}
+
static void reset_page(struct page *page)
{
clear_bit(PG_private, &page->flags);
@@ -1049,7 +1064,6 @@ EXPORT_SYMBOL_GPL(zs_malloc);

void zs_free(struct zs_pool *pool, unsigned long obj)
{
- struct link_free *link;
struct page *first_page, *f_page;
unsigned long f_objidx, f_offset;

@@ -1069,14 +1083,8 @@ void zs_free(struct zs_pool *pool, unsigned long obj)

spin_lock(&class->lock);

- /* Insert this object in containing zspage's freelist */
- link = (struct link_free *)((unsigned char *)kmap_atomic(f_page)
- + f_offset);
- link->next = first_page->freelist;
- kunmap_atomic(link);
- first_page->freelist = (void *)obj;
+ obj_free(obj, f_page, f_offset);

- first_page->inuse--;
fullness = fix_fullness_group(pool, first_page);
spin_unlock(&class->lock);

Dan Streetman

2014-09-11 20:55:15 UTC

Add struct zs_ops with a evict() callback function. Add documentation
to zs_free() function clarifying that it cannot be called with a
zs_pool handle after that handle has been successfully evicted;
since evict calls into a function provided by the zs_pool creator,
the creator is therefore responsible for ensuring this requirement.

This is required to implement zsmalloc shrinking.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
drivers/block/zram/zram_drv.c | 2 +-
include/linux/zsmalloc.h | 6 +++++-
mm/zsmalloc.c | 26 ++++++++++++++++++++++++--
3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bc20fe1..31ba9c7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -328,7 +328,7 @@ static struct zram_meta *zram_meta_alloc(u64 disksize)
goto free_meta;
}

- meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM);
+ meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM, NULL);
if (!meta->mem_pool) {
pr_err("Error creating memory pool\n");
goto free_table;
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 05c2147..2c341d4 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -36,7 +36,11 @@ enum zs_mapmode {

struct zs_pool;

-struct zs_pool *zs_create_pool(gfp_t flags);
+struct zs_ops {
+ int (*evict)(struct zs_pool *pool, unsigned long handle);
+};
+
+struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);
void zs_destroy_pool(struct zs_pool *pool);

unsigned long zs_malloc(struct zs_pool *pool, size_t size);
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a2e417b..3dc7dae 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -221,6 +221,8 @@ struct zs_pool {

gfp_t flags; /* allocation flags used when growing pool */
atomic_long_t pages_allocated;
+
+ struct zs_ops *ops;
};

/*
@@ -256,9 +258,18 @@ static enum fullness_group lru_fg[] = {

#ifdef CONFIG_ZPOOL

+static int zs_zpool_evict(struct zs_pool *pool, unsigned long handle)
+{
+ return zpool_evict(pool, handle);
+}
+
+static struct zs_ops zs_zpool_ops = {
+ .evict = zs_zpool_evict
+};
+
static void *zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops)
{
- return zs_create_pool(gfp);
+ return zs_create_pool(gfp, &zs_zpool_ops);
}

static void zs_zpool_destroy(void *pool)
@@ -1019,7 +1030,7 @@ fail:
* On success, a pointer to the newly created pool is returned,
* otherwise NULL.
*/
-struct zs_pool *zs_create_pool(gfp_t flags)
+struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops)
{
int i, ovhd_size;
struct zs_pool *pool;
@@ -1046,6 +1057,7 @@ struct zs_pool *zs_create_pool(gfp_t flags)
}

pool->flags = flags;
+ pool->ops = ops;

return pool;
}
@@ -1130,6 +1142,16 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
}
EXPORT_SYMBOL_GPL(zs_malloc);

+/**
+ * zs_free - Free the handle from this pool.
+ * @pool: pool containing the handle
+ * @obj: the handle to free
+ *
+ * The caller must provide a valid handle that is contained
+ * in the provided pool. The caller must ensure this is
+ * not called after evict() has returned successfully for the
+ * handle.
+ */
void zs_free(struct zs_pool *pool, unsigned long obj)
{
struct page *first_page, *f_page;

Dan Streetman

2014-09-11 20:55:20 UTC

Add function zs_shrink() to implement zs_pool shrinking. This allows
the pool owner to reduce the size of the zs_pool by one zspage, which
contains one or more struct pages. Once the zs_pool is shrunk, the
freed pages are available for system use.

This is used by zswap to limit its total system memory usage to a
user-defined amount, while attempting to keep the most recently stored
pages compressed in memory, and the oldest (or older) pages are evicted
from the zsmalloc zs_pool and written out to swap disk.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
include/linux/zsmalloc.h | 1 +
mm/zsmalloc.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 2c341d4..07fe84d 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -45,6 +45,7 @@ void zs_destroy_pool(struct zs_pool *pool);

unsigned long zs_malloc(struct zs_pool *pool, size_t size);
void zs_free(struct zs_pool *pool, unsigned long obj);
+int zs_shrink(struct zs_pool *pool);

void *zs_map_object(struct zs_pool *pool, unsigned long handle,
enum zs_mapmode mm);
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 60fd23e..f769c21 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1296,6 +1296,41 @@ void zs_free(struct zs_pool *pool, unsigned long obj)
EXPORT_SYMBOL_GPL(zs_free);

/**
+ * zs_shrink - Shrink the pool
+ * @pool: pool to shrink
+ *
+ * The pool will be shrunk by one zspage, which is some
+ * number of pages in size. On success, the number of freed
+ * pages is returned. On failure, the error is returned.
+ */
+int zs_shrink(struct zs_pool *pool)
+{
+ struct size_class *class;
+ enum fullness_group fullness;
+ struct page *page;
+ int class_idx, ret;
+
+ if (!pool->ops || !pool->ops->evict)
+ return -EINVAL;
+
+ /* if a page is found, the class is locked */
+ page = find_lru_zspage(pool);
+ if (!page)
+ return -ENOENT;
+
+ get_zspage_mapping(page, &class_idx, &fullness);
+ class = &pool->size_class[class_idx];
+
+ /* reclaim_zspage unlocks the class lock */
+ ret = reclaim_zspage(pool, page);
+ if (ret)
+ return ret;
+
+ return class->pages_per_zspage;
+}
+EXPORT_SYMBOL_GPL(zs_shrink);
+
+/**
* zs_map_object - get address of allocated object from handle.
* @pool: pool from which the object was allocated
* @handle: handle returned from zs_malloc

Dan Streetman

2014-09-11 20:55:27 UTC

Add function reclaim_zspage() to evict each object in use in the provided
zspage, so that it can be freed. This is required to be able to shrink
the zs_pool. Check in zs_free() if the handle's zspage is in the reclaim
fullness group, and if so ignore it, since it will be freed during reclaim.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index ab72390..60fd23e 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -170,6 +170,7 @@ enum fullness_group {
_ZS_NR_FULLNESS_GROUPS,

ZS_EMPTY,
+ ZS_RECLAIM
};
#define _ZS_NR_AVAILABLE_FULLNESS_GROUPS ZS_FULL

@@ -786,6 +787,80 @@ cleanup:
return first_page;
}

+/*
+ * This tries to reclaim all the provided zspage's objects by calling the
+ * zs_pool's ops->evict function for each object in use. This requires
+ * the zspage's class lock to be held when calling this function. Since
+ * the evict function may sleep, this drops the class lock before evicting
+ * and objects. No other locks should be held when calling this function.
+ * This will return with the class lock unlocked.
+ *
+ * If there is no zs_pool->ops or ops->evict function, this returns error.
+ *
+ * This returns 0 on success, -err on failure. On failure, some of the
+ * objects may have been freed, but not all. On success, the entire zspage
+ * has been freed and should not be used anymore.
+ */
+static int reclaim_zspage(struct zs_pool *pool, struct page *first_page)
+{
+ struct size_class *class;
+ enum fullness_group fullness;
+ struct page *page = first_page;
+ unsigned long handle;
+ int class_idx, ret = 0;
+
+ BUG_ON(!is_first_page(first_page));
+
+ get_zspage_mapping(first_page, &class_idx, &fullness);
+ class = &pool->size_class[class_idx];
+
+ assert_spin_locked(&class->lock);
+
+ if (!pool->ops || !pool->ops->evict) {
+ spin_unlock(&class->lock);
+ return -EINVAL;
+ }
+
+ /* move the zspage into the reclaim fullness group,
+ * so it's not available for use by zs_malloc,
+ * and won't be freed by zs_free
+ */
+ remove_zspage(first_page, class, fullness);
+ set_zspage_mapping(first_page, class_idx, ZS_RECLAIM);
+
+ spin_unlock(&class->lock);
+
+ might_sleep();
+
+ while (page) {
+ unsigned long offset, idx = 0;
+
+ while ((offset = obj_idx_to_offset(page, idx, class->size))
+ < PAGE_SIZE) {
+ handle = (unsigned long)obj_location_to_handle(page,
+ idx++);
+ if (obj_handle_is_free(first_page, class, handle))
+ continue;
+ ret = pool->ops->evict(pool, handle);
+ if (ret) {
+ spin_lock(&class->lock);
+ fix_fullness_group(pool, first_page);
+ spin_unlock(&class->lock);
+ return ret;
+ }
+ obj_free(handle, page, offset);
+ }
+
+ page = get_next_page(page);
+ }
+
+ free_zspage(first_page);
+
+ atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
+
+ return 0;
+}
+
static struct page *find_available_zspage(struct size_class *class)
{
int i;
@@ -1200,6 +1275,13 @@ void zs_free(struct zs_pool *pool, unsigned long obj)

spin_lock(&class->lock);

+ /* must re-check fullness after taking class lock */
+ get_zspage_mapping(first_page, &class_idx, &fullness);
+ if (fullness == ZS_RECLAIM) {
+ spin_unlock(&class->lock);
+ return; /* will be freed during reclaim */
+ }
+
obj_free(obj, f_page, f_offset);

fullness = fix_fullness_group(pool, first_page);

Dan Streetman

2014-09-11 20:56:06 UTC

Implement the zs_zpool_shrink() function, which previously just returned
EINVAL, by calling the zs_shrink() function to shrink the zs_pool by one
zspage. The zs_shrink() function is called in a loop until the requested
number of pages have been reclaimed, or an error occurs.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index f769c21..4937b2b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -292,7 +292,20 @@ static void zs_zpool_free(void *pool, unsigned long handle)
static int zs_zpool_shrink(void *pool, unsigned int pages,
unsigned int *reclaimed)
{
- return -EINVAL;
+ int total = 0, ret = 0;
+
+ while (total < pages) {
+ ret = zs_shrink(pool);
+ WARN_ON(!ret);
+ if (ret <= 0)
+ break;
+ total += ret;
+ ret = 0;
+ }
+
+ if (reclaimed)
+ *reclaimed = total;
+ return ret;
}

static void *zs_zpool_map(void *pool, unsigned long handle,

Dan Streetman

2014-09-11 20:56:41 UTC

Add function obj_handle_is_free() which scans through the entire
singly-linked list of free objects inside the provided zspage to
determine if the provided object handle is free or not. This is
required by zspage reclaiming, which needs to evict each object
that is currently in use by the zs_pool owner, but has no other
way to determine if an object is in use.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3dc7dae..ab72390 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -605,6 +605,33 @@ static unsigned long obj_idx_to_offset(struct page *page,
return off + obj_idx * class_size;
}

+static bool obj_handle_is_free(struct page *first_page,
+ struct size_class *class, unsigned long handle)
+{
+ unsigned long obj, idx, offset;
+ struct page *page;
+ struct link_free *link;
+
+ BUG_ON(!is_first_page(first_page));
+
+ obj = (unsigned long)first_page->freelist;
+
+ while (obj) {
+ if (obj == handle)
+ return true;
+
+ obj_handle_to_location(obj, &page, &idx);
+ offset = obj_idx_to_offset(page, idx, class->size);
+
+ link = (struct link_free *)kmap_atomic(page) +
+ offset / sizeof(*link);
+ obj = (unsigned long)link->next;
+ kunmap_atomic(link);
+ }
+
+ return false;
+}
+
static void obj_free(unsigned long obj, struct page *page, unsigned long offset)
{
struct page *first_page = get_first_page(page);

Dan Streetman

2014-09-11 20:56:57 UTC

Add an atomic index that allows multiple threads to concurrently and
sequentially iterate through the zspages in all classes and fullness
groups. Add a function find_next_lru_class_fg() to find the next class
fullness group to check for a zspage. Add a function find_lru_zspage()
which calls find_next_lru_class_fg() until a fullness group with an
available zspage is found. This is required to implement zsmalloc pool
shrinking, which needs to be able to find a zspage to reclaim.

Since zsmalloc categorizes its zspages in arrays of fullness groups, which
are themselves inside arrays of classes, there is no (simple) way to
determine the LRU order of all a zsmalloc pool's zspages. But to implement
shrinking, there must be some way to select the zspage to reclaim. This
can't use a simple iteration through all classes, since any failure to
reclaim a zspage would result in any following reclaims to attempt to
reclaim the same zspage, which would likely result in repeated failures
to shrink the pool.

Signed-off-by: Dan Streetman <***@ieee.org>
Cc: Minchan Kim <***@kernel.org>
---
mm/zsmalloc.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index cff8935..a2e417b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -242,6 +242,16 @@ struct mapping_area {
enum zs_mapmode vm_mm; /* mapping mode */
};

+/* atomic counter indicating which class/fg to reclaim from */
+static atomic_t lru_class_fg;
+/* specific order of fg we want to reclaim from */
+static enum fullness_group lru_fg[] = {
+ ZS_ALMOST_EMPTY,
+ ZS_ALMOST_FULL,
+ ZS_FULL
+};
+#define _ZS_NR_LRU_CLASS_FG (ZS_SIZE_CLASSES * ARRAY_SIZE(lru_fg))
+
/* zpool driver */

#ifdef CONFIG_ZPOOL
@@ -752,6 +762,64 @@ static struct page *find_available_zspage(struct size_class *class)
return page;
}

+/* this simply iterates atomically through all classes,
+ * using a specific fullness group. At the end, it starts
+ * over using the next fullness group, and so on. The
+ * fullness groups are used in a specific order, from
+ * least to most full.
+ */
+static void find_next_lru_class_fg(struct zs_pool *pool,
+ struct size_class **class, enum fullness_group *fg)
+{
+ int i = atomic_inc_return(&lru_class_fg);
+
+ if (i >= _ZS_NR_LRU_CLASS_FG) {
+ int orig = i;
+
+ i %= _ZS_NR_LRU_CLASS_FG;
+ /* only need to try once, since if we don't
+ * succeed whoever changed it will also try
+ * and eventually someone will reset it
+ */
+ atomic_cmpxchg(&lru_class_fg, orig, i);
+ }
+ *class = &pool->size_class[i % ZS_SIZE_CLASSES];
+ *fg = lru_fg[i / ZS_SIZE_CLASSES];
+}
+
+/*
+ * This attempts to find the LRU zspage, but that's not really possible
+ * because zspages are not contained in a single LRU list, they're
+ * contained inside fullness groups which are themselves contained
+ * inside classes. So this simply iterates through the classes and
+ * fullness groups to find the next non-empty fullness group, and
+ * uses the LRU zspage there.
+ *
+ * On success, the zspage is returned with its class locked.
+ * On failure, NULL is returned.
+ */
+static struct page *find_lru_zspage(struct zs_pool *pool)
+{
+ struct size_class *class;
+ struct page *page;
+ enum fullness_group fg;
+ int tries = 0;
+
+ while (tries++ < _ZS_NR_LRU_CLASS_FG) {
+ find_next_lru_class_fg(pool, &class, &fg);
+
+ spin_lock(&class->lock);
+
+ page = class->fullness_list[fg];
+ if (page)
+ return list_prev_entry(page, lru);
+
+ spin_unlock(&class->lock);
+ }
+
+ return NULL;
+}
+
#ifdef CONFIG_PGTABLE_MAPPING
static inline int __zs_cpu_up(struct mapping_area *area)
{

Seth Jennings

2014-09-12 03:15:15 UTC

Post by Dan Streetman
Now that zswap can use zsmalloc as a storage pool via zpool, it will
try to shrink its zsmalloc zs_pool once it reaches its max_pool_percent
limit. These patches implement zsmalloc shrinking. The way the pool is
shrunk is by finding a zspage and reclaiming it, by evicting each of its
objects that is in use.
Without these patches zswap, and any other future user of zpool/zsmalloc
that attempts to shrink the zpool/zs_pool, will only get errors and will
be unable to shrink its zpool/zs_pool. With the ability to shrink, zswap
can keep the most recent compressed pages in memory.
Note that the design of zsmalloc makes it impossible to actually find the
LRU zspage, so each class and fullness group is searched in a round-robin
method to find the next zspage to reclaim. Each fullness group orders its
zspages in LRU order, so the oldest zspage is used for each fullness group.

After a quick inspection, the code looks reasonable. Thanks!

I do wonder if this actually works well in practice though.

Have you run any tests that overflow the zsmalloc pool? What does
performance look like at that point? I would expect it would be worse
than allowing the overflow pages to go straight to swap, since, in
almost every case, you would be writing back more than one page. In
some cases, MANY more than one page (up to 255 for a full zspage in the
minimum class size).

There have always been two sticking points with shrinking in zsmalloc
(one of which you have mentioned)

1) Low LRU locality among objects in a zspage. zsmalloc values density
over reclaim ordering so it is hard to make good reclaim selection
decisions.

2) Writeback storm. If you try to reclaim a zspage with lots of objects
(i.e. small class size in fullness group ZS_FULL) you can create a ton
of memory pressure by uncompressing objects and adding them to the swap
cache.

A few reclaim models:

- Reclaim zspage with fewest objects:

This reduces writeback storm but would likely reclaim more recently
allocated zspages that contain more recently used (added) objects.

- Reclaim zspage with largest class size:

This also reduces writeback storm as zspages with larger objects
(poorly compressible) are written back first. This is not LRU though.
This is the best of the options IMHO. I'm not saying that is it good.

- Reclaim LRU round-robin through the fullness groups (approach used):

The LRU here is limited since as the number of object in the zspage
increase, it is LRU only wrt the most recently added object in the
zspage. It also has high risk of a writeback storm since it will
eventually try to reclaim from the ZS_FULL group of the minimum class
size.

There is also the point that writing back objects might not be the best
way to reclaim from zsmalloc at all. Maybe compaction is the way to go.
This was recently discussed on the list.

http://marc.info/?l=linux-mm&m=140917577412645&w=2

As mentioned in that thread, it would require zsmalloc to add a
layer of indirection so that the objects could be relocated without
notifying the user. The compaction mechanism would also be "fun" to
design I imagine. But, in my mind, compaction is really needed,
regardless of whether or not zsmalloc is capable of writeback, and would
be more beneficial.

tl;dr version:

I would really need to see some evidence (and try it myself) that this
didn't run off a cliff when you overflow the zsmalloc pool. It seems
like additional risk and complexity to avoid LRU inversion _after_ the
pool overflows. And by "avoid" I mean "maybe avoid" as the reclaim
selection is just slightly more LRUish than random selection.

Thanks,
Seth

Post by Dan Streetman
---
This patch set applies to linux-next.
zsmalloc: fix init_zspage free obj linking
zsmalloc: add fullness group list for ZS_FULL zspages
zsmalloc: always update lru ordering of each zspage
zsmalloc: move zspage obj freeing to separate function
zsmalloc: add atomic index to find zspage to reclaim
zsmalloc: add zs_ops to zs_pool
zsmalloc: add obj_handle_is_free()
zsmalloc: add reclaim_zspage()
zsmalloc: add zs_shrink()
zsmalloc: implement zs_zpool_shrink() with zs_shrink()
drivers/block/zram/zram_drv.c | 2 +-
include/linux/zsmalloc.h | 7 +-
mm/zsmalloc.c | 314 +++++++++++++++++++++++++++++++++++++-----
3 files changed, 290 insertions(+), 33 deletions(-)
--
1.8.3.1

Minchan Kim

2014-09-12 05:46:48 UTC

1. Pz, Cc Mel who was strong against zswap with zsmalloc.
2. I don't think LRU stuff should be in allocator layer. Exp, it's really
hard to work well in zsmalloc design.
3. If you want to add another writeback, make zswap writeback sane first.
current implemenation(zswap store -> zbud reclaim -> zswap writeback,
even) is really ugly.
4. Don't make zsmalloc complicated without any data(benefit, regression)
I will never ack if you don't give any number and real usecase.

Dan Streetman

2014-09-12 17:05:39 UTC

1. Pz, Cc Mel who was strong against zswap with zsmalloc.
2. I don't think LRU stuff should be in allocator layer. Exp, it's really
hard to work well in zsmalloc design.

I didn't add any LRU - the existing fullness group LRU ordering is
already there. And yes, the zsmalloc design prevents any real LRU
ordering, beyond per-fullness-group LRU ordering.

Post by Minchan Kim
3. If you want to add another writeback, make zswap writeback sane first.
current implemenation(zswap store -> zbud reclaim -> zswap writeback,
even) is really ugly.

why what's wrong with that? how else can zbud/zsmalloc evict stored objects?

Post by Minchan Kim
4. Don't make zsmalloc complicated without any data(benefit, regression)
I will never ack if you don't give any number and real usecase.

ok, i'll run performance tests then, but let me know if you see any
technical problems with any of the patches before then.

thanks!

--
Kind regards,
Minchan Kim

Minchan Kim

2014-09-15 00:01:14 UTC

1. Pz, Cc Mel who was strong against zswap with zsmalloc.
2. I don't think LRU stuff should be in allocator layer. Exp, it's really
hard to work well in zsmalloc design.

I didn't add any LRU - the existing fullness group LRU ordering is
already there. And yes, the zsmalloc design prevents any real LRU

I don't think It's not LRU for reclaiming but just simple linked list for
finding free slot.

Post by Dan Streetman
ordering, beyond per-fullness-group LRU ordering.

Yes.

Post by Minchan Kim
3. If you want to add another writeback, make zswap writeback sane first.
current implemenation(zswap store -> zbud reclaim -> zswap writeback,
even) is really ugly.

why what's wrong with that? how else can zbud/zsmalloc evict stored objects?

You can refer Mel's suggestion for zswap/zsmalloc and writeback problem.

http://www.spinics.net/lists/linux-mm/msg61601.html
http://lkml.iu.edu//hypermail/linux/kernel/1304.1/04334.html

I think LRU/writeback should be upper layer, not allocator itself.
Please, don't force every allocator to implement it for only zswap.

Post by Minchan Kim
4. Don't make zsmalloc complicated without any data(benefit, regression)
I will never ack if you don't give any number and real usecase.

ok, i'll run performance tests then, but let me know if you see any
technical problems with any of the patches before then.
thanks!

--
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

Dan Streetman

2014-09-15 14:29:36 UTC

1. Pz, Cc Mel who was strong against zswap with zsmalloc.
2. I don't think LRU stuff should be in allocator layer. Exp, it's really
hard to work well in zsmalloc design.

I didn't add any LRU - the existing fullness group LRU ordering is
already there. And yes, the zsmalloc design prevents any real LRU

I don't think It's not LRU for reclaiming but just simple linked list for
finding free slot.

yes, but it does happen to be in LRU order. However the LRU ordering
here isn't as important as being able to free the least amount of
pages in order to shrink the zspool.

Post by Dan Streetman
ordering, beyond per-fullness-group LRU ordering.

Yes.

Post by Minchan Kim
3. If you want to add another writeback, make zswap writeback sane first.
current implemenation(zswap store -> zbud reclaim -> zswap writeback,
even) is really ugly.

why what's wrong with that? how else can zbud/zsmalloc evict stored objects?

You can refer Mel's suggestion for zswap/zsmalloc and writeback problem.

As far as I can tell Mel's complaint was with zswap synchronously
shrinking the pool one-page-at-a-time during store(), which will block
anything trying to write a swap page - although Mel correct me if I'm
minunderstanding it.

That's not relevant for the reclaim->evict portion of zbud and
zsmalloc. Once zbud or zsmalloc have been asked to shrink their size,
they must reclaim a zbud page or zspage, respectively, and to do that
they have to clear out any used handles from it. Asking the handle
owner (zswap in this case) to evict it is the only option I can see.
And I don't see what is wrong with that, from the perspective of
zbud/zsmalloc.

Updating zswap to pre-emptively shrink its pool before filling up is
something that could be done, but would be entirely in zswap, and
doesn't affect how zbud or zsmalloc work w.r.t shrinking, reclaim, or
evict. It's not related to adding shrinking to zsmalloc.

Post by Minchan Kim
http://www.spinics.net/lists/linux-mm/msg61601.html
http://lkml.iu.edu//hypermail/linux/kernel/1304.1/04334.html
I think LRU/writeback should be upper layer, not allocator itself.
Please, don't force every allocator to implement it for only zswap.

zswap could maintain an LRU list, but what it can't do it know which
of its stored pages are grouped together by the allocator. So freeing
pages in a strictly LRU order would almost certainly 1) require
evicting many more pages than the allocator would have just to shrink
the pool by 1 page, and 2) increase the allocator's fragmentation,
possibly quite badly depending on how much shrinking is needed.

At the point when zswap needs to start shrinking its pool, there is
clearly a lot of memory pressure, and it doesn't make sense to evict
more pages than it needs to, nor does it make sense to increase
fragmentation in the storage pool, and waste memory.

Post by Minchan Kim
4. Don't make zsmalloc complicated without any data(benefit, regression)
I will never ack if you don't give any number and real usecase.

ok, i'll run performance tests then, but let me know if you see any
technical problems with any of the patches before then.
thanks!