We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas. However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory. This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc. To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking. This is especially the case if we need to
reclaim memory to lock down the region. We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.
Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning. Note that this opens a window
where the mmap list could change in a multithreaded process. So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end). If not, we return an error, -EAGAIN,
and let the caller deal with it.
Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma. Again, let the caller deal
with it. Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.
With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down. However, I occassionally see delays while unlocking or
unmapping a large mlocked region. Should we also downgrade the mmap_sem
for the unlock path?
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>