About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / block / barrier.txt


Based on kernel version 2.6.36. Page generated on 2010-10-22 13:20 EST.

1	I/O Barriers
2	============
3	Tejun Heo <htejun@gmail.com>, July 22 2005
4	
5	I/O barrier requests are used to guarantee ordering around the barrier
6	requests.  Unless you're crazy enough to use disk drives for
7	implementing synchronization constructs (wow, sounds interesting...),
8	the ordering is meaningful only for write requests for things like
9	journal checkpoints.  All requests queued before a barrier request
10	must be finished (made it to the physical medium) before the barrier
11	request is started, and all requests queued after the barrier request
12	must be started only after the barrier request is finished (again,
13	made it to the physical medium).
14	
15	In other words, I/O barrier requests have the following two properties.
16	
17	1. Request ordering
18	
19	Requests cannot pass the barrier request.  Preceding requests are
20	processed before the barrier and following requests after.
21	
22	Depending on what features a drive supports, this can be done in one
23	of the following three ways.
24	
25	i. For devices which have queue depth greater than 1 (TCQ devices) and
26	support ordered tags, block layer can just issue the barrier as an
27	ordered request and the lower level driver, controller and drive
28	itself are responsible for making sure that the ordering constraint is
29	met.  Most modern SCSI controllers/drives should support this.
30	
31	NOTE: SCSI ordered tag isn't currently used due to limitation in the
32	      SCSI midlayer, see the following random notes section.
33	
34	ii. For devices which have queue depth greater than 1 but don't
35	support ordered tags, block layer ensures that the requests preceding
36	a barrier request finishes before issuing the barrier request.  Also,
37	it defers requests following the barrier until the barrier request is
38	finished.  Older SCSI controllers/drives and SATA drives fall in this
39	category.
40	
41	iii. Devices which have queue depth of 1.  This is a degenerate case
42	of ii.  Just keeping issue order suffices.  Ancient SCSI
43	controllers/drives and IDE drives are in this category.
44	
45	2. Forced flushing to physical medium
46	
47	Again, if you're not gonna do synchronization with disk drives (dang,
48	it sounds even more appealing now!), the reason you use I/O barriers
49	is mainly to protect filesystem integrity when power failure or some
50	other events abruptly stop the drive from operating and possibly make
51	the drive lose data in its cache.  So, I/O barriers need to guarantee
52	that requests actually get written to non-volatile medium in order.
53	
54	There are four cases,
55	
56	i. No write-back cache.  Keeping requests ordered is enough.
57	
58	ii. Write-back cache but no flush operation.  There's no way to
59	guarantee physical-medium commit order.  This kind of devices can't to
60	I/O barriers.
61	
62	iii. Write-back cache and flush operation but no FUA (forced unit
63	access).  We need two cache flushes - before and after the barrier
64	request.
65	
66	iv. Write-back cache, flush operation and FUA.  We still need one
67	flush to make sure requests preceding a barrier are written to medium,
68	but post-barrier flush can be avoided by using FUA write on the
69	barrier itself.
70	
71	
72	How to support barrier requests in drivers
73	------------------------------------------
74	
75	All barrier handling is done inside block layer proper.  All low level
76	drivers have to are implementing its prepare_flush_fn and using one
77	the following two functions to indicate what barrier type it supports
78	and how to prepare flush requests.  Note that the term 'ordered' is
79	used to indicate the whole sequence of performing barrier requests
80	including draining and flushing.
81	
82	typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
83	
84	int blk_queue_ordered(struct request_queue *q, unsigned ordered,
85			      prepare_flush_fn *prepare_flush_fn);
86	
87	@q			: the queue in question
88	@ordered		: the ordered mode the driver/device supports
89	@prepare_flush_fn	: this function should prepare @rq such that it
90				  flushes cache to physical medium when executed
91	
92	For example, SCSI disk driver's prepare_flush_fn looks like the
93	following.
94	
95	static void sd_prepare_flush(struct request_queue *q, struct request *rq)
96	{
97		memset(rq->cmd, 0, sizeof(rq->cmd));
98		rq->cmd_type = REQ_TYPE_BLOCK_PC;
99		rq->timeout = SD_TIMEOUT;
100		rq->cmd[0] = SYNCHRONIZE_CACHE;
101		rq->cmd_len = 10;
102	}
103	
104	The following seven ordered modes are supported.  The following table
105	shows which mode should be used depending on what features a
106	device/driver supports.  In the leftmost column of table,
107	QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
108	
109	The table is followed by description of each mode.  Note that in the
110	descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
111	used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
112	preceding step must be complete before proceeding to the next step.
113	'->' indicates that the next step can start as soon as the previous
114	step is issued.
115	
116		    write-back cache	ordered tag	flush		FUA
117	-----------------------------------------------------------------------
118	NONE		yes/no		N/A		no		N/A
119	DRAIN		no		no		N/A		N/A
120	DRAIN_FLUSH	yes		no		yes		no
121	DRAIN_FUA	yes		no		yes		yes
122	TAG		no		yes		N/A		N/A
123	TAG_FLUSH	yes		yes		yes		no
124	TAG_FUA		yes		yes		yes		yes
125	
126	
127	QUEUE_ORDERED_NONE
128		I/O barriers are not needed and/or supported.
129	
130		Sequence: N/A
131	
132	QUEUE_ORDERED_DRAIN
133		Requests are ordered by draining the request queue and cache
134		flushing isn't needed.
135	
136		Sequence: drain => barrier
137	
138	QUEUE_ORDERED_DRAIN_FLUSH
139		Requests are ordered by draining the request queue and both
140		pre-barrier and post-barrier cache flushings are needed.
141	
142		Sequence: drain => preflush => barrier => postflush
143	
144	QUEUE_ORDERED_DRAIN_FUA
145		Requests are ordered by draining the request queue and
146		pre-barrier cache flushing is needed.  By using FUA on barrier
147		request, post-barrier flushing can be skipped.
148	
149		Sequence: drain => preflush => barrier
150	
151	QUEUE_ORDERED_TAG
152		Requests are ordered by ordered tag and cache flushing isn't
153		needed.
154	
155		Sequence: barrier
156	
157	QUEUE_ORDERED_TAG_FLUSH
158		Requests are ordered by ordered tag and both pre-barrier and
159		post-barrier cache flushings are needed.
160	
161		Sequence: preflush -> barrier -> postflush
162	
163	QUEUE_ORDERED_TAG_FUA
164		Requests are ordered by ordered tag and pre-barrier cache
165		flushing is needed.  By using FUA on barrier request,
166		post-barrier flushing can be skipped.
167	
168		Sequence: preflush -> barrier
169	
170	
171	Random notes/caveats
172	--------------------
173	
174	* SCSI layer currently can't use TAG ordering even if the drive,
175	controller and driver support it.  The problem is that SCSI midlayer
176	request dispatch function is not atomic.  It releases queue lock and
177	switch to SCSI host lock during issue and it's possible and likely to
178	happen in time that requests change their relative positions.  Once
179	this problem is solved, TAG ordering can be enabled.
180	
181	* Currently, no matter which ordered mode is used, there can be only
182	one barrier request in progress.  All I/O barriers are held off by
183	block layer until the previous I/O barrier is complete.  This doesn't
184	make any difference for DRAIN ordered devices, but, for TAG ordered
185	devices with very high command latency, passing multiple I/O barriers
186	to low level *might* be helpful if they are very frequent.  Well, this
187	certainly is a non-issue.  I'm writing this just to make clear that no
188	two I/O barrier is ever passed to low-level driver.
189	
190	* Completion order.  Requests in ordered sequence are issued in order
191	but not required to finish in order.  Barrier implementation can
192	handle out-of-order completion of ordered sequence.  IOW, the requests
193	MUST be processed in order but the hardware/software completion paths
194	are allowed to reorder completion notifications - eg. current SCSI
195	midlayer doesn't preserve completion order during error handling.
196	
197	* Requeueing order.  Low-level drivers are free to requeue any request
198	after they removed it from the request queue with
199	blkdev_dequeue_request().  As barrier sequence should be kept in order
200	when requeued, generic elevator code takes care of putting requests in
201	order around barrier.  See blk_ordered_req_seq() and
202	ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
203	
204	Note that block drivers must not requeue preceding requests while
205	completing latter requests in an ordered sequence.  Currently, no
206	error checking is done against this.
207	
208	* Error handling.  Currently, block layer will report error to upper
209	layer if any of requests in an ordered sequence fails.  Unfortunately,
210	this doesn't seem to be enough.  Look at the following request flow.
211	QUEUE_ORDERED_TAG_FLUSH is in use.
212	
213	 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
214						  still in elevator
215	
216	Let's say request [2], [3] are write requests to update file system
217	metadata (journal or whatever) and [barrier] is used to mark that
218	those updates are valid.  Consider the following sequence.
219	
220	 i.	Requests [0] ~ [post] leaves the request queue and enters
221		low-level driver.
222	 ii.	After a while, unfortunately, something goes wrong and the
223		drive fails [2].  Note that any of [0], [1] and [3] could have
224		completed by this time, but [pre] couldn't have been finished
225		as the drive must process it in order and it failed before
226		processing that command.
227	 iii.	Error handling kicks in and determines that the error is
228		unrecoverable and fails [2], and resumes operation.
229	 iv.	[pre] [barrier] [post] gets processed.
230	 v.	*BOOM* power fails
231	
232	The problem here is that the barrier request is *supposed* to indicate
233	that filesystem update requests [2] and [3] made it safely to the
234	physical medium and, if the machine crashes after the barrier is
235	written, filesystem recovery code can depend on that.  Sadly, that
236	isn't true in this case anymore.  IOW, the success of a I/O barrier
237	should also be dependent on success of some of the preceding requests,
238	where only upper layer (filesystem) knows what 'some' is.
239	
240	This can be solved by implementing a way to tell the block layer which
241	requests affect the success of the following barrier request and
242	making lower lever drivers to resume operation on error only after
243	block layer tells it to do so.
244	
245	As the probability of this happening is very low and the drive should
246	be faulty, implementing the fix is probably an overkill.  But, still,
247	it's there.
248	
249	* In previous drafts of barrier implementation, there was fallback
250	mechanism such that, if FUA or ordered TAG fails, less fancy ordered
251	mode can be selected and the failed barrier request is retried
252	automatically.  The rationale for this feature was that as FUA is
253	pretty new in ATA world and ordered tag was never used widely, there
254	could be devices which report to support those features but choke when
255	actually given such requests.
256	
257	 This was removed for two reasons 1. it's an overkill 2. it's
258	impossible to implement properly when TAG ordering is used as low
259	level drivers resume after an error automatically.  If it's ever
260	needed adding it back and modifying low level drivers accordingly
261	shouldn't be difficult.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog