Make reiserfs to write via generic routines.
Original reiserfs write optimized for big writes is deadlock rone
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the generic block "cont" routines to handle the new aops. Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).
write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various fixes and improvements
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement new aops for some of the simpler filesystems.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).
[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts of
those buffers because it thinks they contain stale data: wrong, they are
actually uptodate so this is a data loss situation.
Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.
Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).
[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nobh mode error handling is not just pretty slack, it's wrong.
One cannot zero out the whole page to ensure new blocks are zeroed, because
it just brings the whole page "uptodate" with zeroes even if that may not
be the correct uptodate data. Also, other parts of the page may already
contain dirty data which would get lost by zeroing it out. Thirdly, the
writeback of zeroes to the new blocks will also erase existing blocks. All
these conditions are pagecache and/or filesystem corruption.
The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only this
state, because at the point we start dirtying parts of the page (new
blocks, with zeroes), the handling of IO errors becomes impossible without
buffers because the page may only be partially uptodate, in which case the
page flags allone cannot capture the state of the parts of the page.
So allocate all buffers for the page upfront, but leave them unattached so
that they don't pick up any other references and can be freed when we're
done. If the error path is hit, then zero the new buffers as the regular
buffer path does, then attach the buffers to the page so that it can
actually be written out correctly and be subject to the normal IO error
handling paths.
As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit b5810039a5 contains the note
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.
And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).
There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely
I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.
Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).
As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.
When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.
Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.
The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos
It is more consistent and better supports huge files.
Thanks to Peter for the nice proposal!
[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because we cherrypicked SGI-Modid xfs-linux-melb:xfs-kern:29675a
and it depended on the sgi mod which removed io_vnode (which was
not cherrypicked in 23) it was hand modified.
This fixes things back up (to the originial mod) now we have moved
on again.
Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Removes STATIC on xfs_freeze function which was not manually
applied for SGI-Modid: xfs-linux-melb:xfs-kern:29504a.
Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Put back the QUEUE_ORDERED_NONE test which caused us grief in sles when it
was taken out as, IIRC, it allowed md/lvm to be thought of as supporting
barriers when they weren't in some configurations. This patch will be
reverting what went in as part of a change for the SGI-pv 964544
(SGI-Modid: xfs-linux-melb:xfs-kern:28568a).
SGI-PV: 971783
SGI-Modid: xfs-linux-melb:xfs-kern:29882a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
In xfs_fs_sync_super() treat a sync the same as a filesystem freeze. This
is needed to force the log to disk for inodes which are not marked dirty
in the Linux inode (the inodes are marked dirty on completion of the log
I/O) and so sync_inodes() will not flush them.
In xfs_fs_write_inode() a synchronous flush will not get an EAGAIN from
xfs_inode_flush() and if an asynchronous flush returns EAGAIN we should
pass it on to the caller. If we get an error while flushing the inode then
re-dirty it so we can try again later.
SGI-PV: 971670
SGI-Modid: xfs-linux-melb:xfs-kern:29860a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Here 'agino' increments through the inodes in an allocation group. At the
end of the innermost 'for' loop it will hold the value of the next inode
to look at (ie the first inode in the next cluster/chunk). Assigning
'lastino' to 'agino' resets it to the last inode in the last inode cluster
we just looked at. This causes us to look up the very same cluster and
examine all the inodes all over again, and again, and again...
We also want to set 'lastino' for the cases when we're not interested in
the inode so that the next call to bulkstat won't re-examine the same
uninteresting inodes.
SGI-PV: 971064
SGI-Modid: xfs-linux-melb:xfs-kern:29840a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify the prototype for xfs_create/xfs_mkdir/xfs_symlink by not passing
down a bhv_vattr_t that just hogs stack space. Instead pass down the mode
in a mode_t and in case of xfs_create the rdev as a scalar type as well.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29794a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need to call into xfs_getattr and put a big bhv_vattr_t on the stack
just to get a little information from the XFS inode.
Add a helper called xfs_ioc_fsgetxattr instead that deals with retrieving
the information in a clean way.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29780a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In the following scenario xfs_bulkstat() returns incorrect stale inode
state:
1. File_A is created and its inode synced to disk. 2. File_A is unlinked
and doesn't exist anymore. 3. Filesystem sync is invoked. 4. File_B is
created. File_B happens to reclaim File_A's inode. 5. xfs_bulkstat() is
called and detects File_B but reports the
incorrect File_A inode state.
Explanation for the incorrect inode state is that inodes are not
immediately synced on file create for performance reasons. This leaves the
on-disk inode buffer uninitialized (or with old state from a previous
generation inode) and this is what xfs_bulkstat() would report.
The patch marks the on-disk inode buffer "dirty" on unlink. When the inode
is reclaimed (by a new file create), xfs_bulkstat() would filter this
inode by the "dirty" mark. Once the inode is flushed to disk, the on-disk
buffer "dirty" mark is automatically removed and a following
xfs_bulkstat() would return the correct inode state.
Marking the on-disk inode buffer "dirty" on unlink is achieved by setting
the on-disk di_nlink field to 0. Note that the in-core di_nlink has
already been set to 0 and a corresponding transaction logged by
xfs_droplink(). This is an exception from the rule that any on-disk inode
buffer changes has to be followed by a disk write (inode flush).
Synchronizing the in-core to on-disk di_nlink values in advance (before
the actual inode flush to disk) should be fine in this case because the
inode is already unlinked and it would never change its di_nlink again for
this inode generation.
SGI-PV: 970842
SGI-Modid: xfs-linux-melb:xfs-kern:29757a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mark Goodwin <markgw@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Fix for a regression caused by a recent patch
that moved the DMAPI mount option processing inside xfs_parseargs(). The
DMAPI mount option used to be processed in the DMAPI module loaded before
xfs_parseargs() was invoked.
SGI-PV: 970451
SGI-Modid: xfs-linux-melb:xfs-kern:29683a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Synchronous writes currently log inode changes before syncing pages to
disk. Since the file size is updated on I/O completion we wont be writing
out the updated file size and if we crash the file will have the wrong
size. This change moves the logging after the syncing of the pages to
ensure we log the correct file size.
SGI-PV: 970334
SGI-Modid: xfs-linux-melb:xfs-kern:29649a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
m_growlock only needs plain binary mutex semantics, so use a struct mutex
instead of a semaphore for it.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29512a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
... or in the case of XLOG_TIC_ADD_OPHDR remove a useless macro entirely.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29511a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vfs doesn't have any members left we can kill it and
go directly from the super_block to the xfs_mount everywhere.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29509a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All flags are added to xfs_mount's m_flag instead. Note that the 32bit
inode flag was duplicated in both of them, but only cleared in the mount
when it was not nessecary due to the filesystem beeing small enough. Two
flags are still required here - one to indicate the mount option setting,
and one to indicate if it applies or not.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29507a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
vfs_altfsid was just a pointer to mp->m_fixedfsid so we can trivially
replace it with the latter. vfs_fsid also was identical to m_fixedfsid
through rather obfuscated ways so we can kill it as well and simply its
only user.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29506a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Also remove the now dead behavior code.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29505a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All vfs ops now take struct xfs_mount pointers and the behaviour related
glue is split out into methods of its own.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29504a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if quota
support is not available, and there are some new quota operation for the
quotactl syscall and calls to quote in the mount, unmount and sync
callchains.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29503a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if dmapi
support is not available, and there is a new dm operation to send the
mount event.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29502a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In the next patch we need to look at the mount structure until just before
it's freed, so we need to be able to free it as the very last thing in
xfs_unmount.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29501a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vnode is empty we can just kill it. Retain bhv_vnode_t
as a typedef for struct inode for the time being until all the fallout is
cleaned up.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29500a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It's entirely unused except for ignored arguments in the mrlock
initialization, so remove it.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29499a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>