Message ID | 20181215054840.5960-2-xiaoguang.wang@linux.alibaba.com |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] ext4: try to merge unwritten extents who are also not under io | expand |
On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote: > With "nodelalloc", blocks are allocated at the time of writing, and with > "dioread_nolock", these allocated blocks are marked as unwritten as well, > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped. I've been looking at your patches, and it seems that a simpler way, perhaps more maintainable approach in the long term is to change how we write to newly allocated blocks. Today, we have two ways of doing this: 1) In the dioread_nolock case, we allocate blocks, insert an entry in the extent tree with the blocks marked uninitialized, write the blocks, and then mark the blocks initialized. 2) In the !dioread_nolock case, we allocate blocks, insert an entry to the extent tree, write the blocks --- and if we start a commit, we write out all dirty pages associated with that inode (in the default data=writeback case) to avoid stale writes. So what if we change the dioread_nolock case to do write the blocks first, and *then* insert the entry into the extent tree? This avoids stale data getting exposed, either by a direct I/O read, or after a crash (which means we avoid needing to do the force write-out). So what we would need to do is to pass a flag to ext4_map_blocks() which causes it to *not* make any on-disk changes. Instead, it would track the fact that blocks have be reserved in the buddy bitmap (this is how we prevent blocks from being preallocated after they are deleted, but before the transaction has been committed), and the location of the assigned blocks in the extent_status tree. Since no on-disk changes are being made, we wouldn't need to hold the transaction open. Then in the callback after the blocks are written, using the starting logical block number stored in the io_end structure, we either convert the unwritten extents or actually insert the newly allocated blocks in the extent tree and update the on-disk bitmap allocation bitmaps. Once we get this working, it should be easy to make dioread_nolock for 1k block sizes; it keeps the time that the handle open very short; and it completely obviates the need for data=writeback. What do folks think? - Ted
On Wed 23-01-19 01:27:38, Theodore Y. Ts'o wrote: > On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote: > > With "nodelalloc", blocks are allocated at the time of writing, and with > > "dioread_nolock", these allocated blocks are marked as unwritten as well, > > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped. > > I've been looking at your patches, and it seems that a simpler way, > perhaps more maintainable approach in the long term is to change how > we write to newly allocated blocks. Today, we have two ways of doing > this: > > 1) In the dioread_nolock case, we allocate blocks, insert an entry in > the extent tree with the blocks marked uninitialized, write the > blocks, and then mark the blocks initialized. > > 2) In the !dioread_nolock case, we allocate blocks, insert an entry to > the extent tree, write the blocks --- and if we start a commit, we > write out all dirty pages associated with that inode (in the default > data=writeback case) to avoid stale writes. > > So what if we change the dioread_nolock case to do write the blocks > first, and *then* insert the entry into the extent tree? This avoids > stale data getting exposed, either by a direct I/O read, or after a > crash (which means we avoid needing to do the force write-out). > > So what we would need to do is to pass a flag to ext4_map_blocks() > which causes it to *not* make any on-disk changes. Instead, it would > track the fact that blocks have be reserved in the buddy bitmap (this > is how we prevent blocks from being preallocated after they are > deleted, but before the transaction has been committed), and the > location of the assigned blocks in the extent_status tree. Since no > on-disk changes are being made, we wouldn't need to hold the > transaction open. > > Then in the callback after the blocks are written, using the starting > logical block number stored in the io_end structure, we either convert > the unwritten extents or actually insert the newly allocated blocks in > the extent tree and update the on-disk bitmap allocation bitmaps. > > Once we get this working, it should be easy to make dioread_nolock for > 1k block sizes; it keeps the time that the handle open very short; and > it completely obviates the need for data=writeback. > > What do folks think? Hum, so there is the problem that adding extent to the extent tree may need some block allocations for metadata. So we'd have to carry over delalloc reservations upto the io-end time. But that should be doable, just needs some work. Also in the dioread_nolock case we don't have problems with page lock & page writeback vs transaction start deadlocks as I've described in my another email regarding ext4_writepages(). So I don't see any hole in this and the performance should be good. I like this! Honza
On Tue, Jan 22, 2019 at 10:30 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote: > > With "nodelalloc", blocks are allocated at the time of writing, and with > > "dioread_nolock", these allocated blocks are marked as unwritten as well, > > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped. > > I've been looking at your patches, and it seems that a simpler way, > perhaps more maintainable approach in the long term is to change how > we write to newly allocated blocks. Today, we have two ways of doing > this: > > 1) In the dioread_nolock case, we allocate blocks, insert an entry in > the extent tree with the blocks marked uninitialized, write the > blocks, and then mark the blocks initialized. > > 2) In the !dioread_nolock case, we allocate blocks, insert an entry to > the extent tree, write the blocks --- and if we start a commit, we > write out all dirty pages associated with that inode (in the default > data=writeback case) to avoid stale writes. > > So what if we change the dioread_nolock case to do write the blocks > first, and *then* insert the entry into the extent tree? This avoids > stale data getting exposed, either by a direct I/O read, or after a > crash (which means we avoid needing to do the force write-out). > > So what we would need to do is to pass a flag to ext4_map_blocks() > which causes it to *not* make any on-disk changes. Instead, it would > track the fact that blocks have be reserved in the buddy bitmap (this > is how we prevent blocks from being preallocated after they are > deleted, but before the transaction has been committed), and the > location of the assigned blocks in the extent_status tree. Since no > on-disk changes are being made, we wouldn't need to hold the > transaction open. > > Then in the callback after the blocks are written, using the starting > logical block number stored in the io_end structure, we either convert > the unwritten extents or actually insert the newly allocated blocks in > the extent tree and update the on-disk bitmap allocation bitmaps. > > Once we get this working, it should be easy to make dioread_nolock for > 1k block sizes; it keeps the time that the handle open very short; and > it completely obviates the need for data=writeback. > > What do folks think? > So that (reserve, write, insert extent records) is basically what btrfs is doing and I feel like it will work better than the current way. My only concern is performance since metadata reservation for delalloc now becomes more and needs to be carried until endio, a perf. spike would appear if the foreground writer needs to wait for flushing dirty pages to reclaim metadata credits. thanks, liubo
hi, > On Wed 23-01-19 01:27:38, Theodore Y. Ts'o wrote: >> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote: >>> With "nodelalloc", blocks are allocated at the time of writing, and with >>> "dioread_nolock", these allocated blocks are marked as unwritten as well, >>> so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped. >> >> I've been looking at your patches, and it seems that a simpler way, >> perhaps more maintainable approach in the long term is to change how >> we write to newly allocated blocks. Today, we have two ways of doing >> this: >> >> 1) In the dioread_nolock case, we allocate blocks, insert an entry in >> the extent tree with the blocks marked uninitialized, write the >> blocks, and then mark the blocks initialized. >> >> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to >> the extent tree, write the blocks --- and if we start a commit, we >> write out all dirty pages associated with that inode (in the default >> data=writeback case) to avoid stale writes. >> >> So what if we change the dioread_nolock case to do write the blocks >> first, and *then* insert the entry into the extent tree? This avoids >> stale data getting exposed, either by a direct I/O read, or after a >> crash (which means we avoid needing to do the force write-out). >> >> So what we would need to do is to pass a flag to ext4_map_blocks() >> which causes it to *not* make any on-disk changes. Instead, it would >> track the fact that blocks have be reserved in the buddy bitmap (this >> is how we prevent blocks from being preallocated after they are >> deleted, but before the transaction has been committed), and the >> location of the assigned blocks in the extent_status tree. Since no >> on-disk changes are being made, we wouldn't need to hold the >> transaction open. >> >> Then in the callback after the blocks are written, using the starting >> logical block number stored in the io_end structure, we either convert >> the unwritten extents or actually insert the newly allocated blocks in >> the extent tree and update the on-disk bitmap allocation bitmaps. >> >> Once we get this working, it should be easy to make dioread_nolock for >> 1k block sizes; it keeps the time that the handle open very short; and >> it completely obviates the need for data=writeback. >> >> What do folks think? > > Hum, so there is the problem that adding extent to the extent tree may need > some block allocations for metadata. So we'd have to carry over delalloc > reservations upto the io-end time. But that should be doable, just needs > some work. Also in the dioread_nolock case we don't have problems with > page lock & page writeback vs transaction start deadlocks as I've described > in my another email regarding ext4_writepages(). So I don't see any hole in > this and the performance should be good. I like this! First sorry for late response, I was busy with some internal work. I really like your ideas, thanks both of you. Writing dirty pages while holding open jbd2 handle will result in some issues, for example, system load will be high, many tasks in the D state, some of them is stuck in wait_transaction_locked(). Currently I have tried some methods to fix this issue, for example, change MAX_WRITEPAGES_EXTENT_LEN to 256, then we can make sure that one bio will be generated at most, and this bio will be submitted after journal stop: if (!ext4_handle_valid(handle) || handle->h_sync == 0) { ext4_journal_stop(handle); handle = NULL; mpd.do_map = 0; } /* Submit prepared bio */ ext4_io_submit(&mpd.io_submit); After this change, the number of tasks in D state will decrease much, Of course, my method is not good, I just say writing diry pages while holding open jbd2 handle is not good :) and I believe that your ideas can fix this kind of problems, thanks again. Regards, Xiaoguang Wang Regards, Xiaoguang Wang > > Honza >
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 444c739470a5..de73b0152892 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4038,9 +4038,13 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode, /* * repeat fallocate creation request * we already have an unwritten extent + * + * With nodelalloc + dioread_nolock, write can also come here, + * so make sure map is set with new to avoid exposing stale + * data to reads. */ if (flags & EXT4_GET_BLOCKS_UNWRIT_EXT) { - map->m_flags |= EXT4_MAP_UNWRITTEN; + map->m_flags |= EXT4_MAP_UNWRITTEN | EXT4_MAP_NEW; goto map_out; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ba557a731081..4dbb43ab9d6e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -822,7 +822,7 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock, ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", inode->i_ino, create); return _ext4_get_block(inode, iblock, bh_result, - EXT4_GET_BLOCKS_IO_CREATE_EXT); + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); } /* Maximum number of blocks we map for direct IO at once. */