Based on kernel version 2.6.32. Page generated on 2009-12-11 16:22 EST.
1 Anticipatory IO scheduler 2 ------------------------- 3 Nick Piggin <piggin[AT]cyberone.com[DOT]au> 13 Sep 2003 4 5 Attention! Database servers, especially those using "TCQ" disks should 6 investigate performance with the 'deadline' IO scheduler. Any system with high 7 disk performance requirements should do so, in fact. 8 9 If you see unusual performance characteristics of your disk systems, or you 10 see big performance regressions versus the deadline scheduler, please email 11 me. Database users don't bother unless you're willing to test a lot of patches 12 from me ;) its a known issue. 13 14 Also, users with hardware RAID controllers, doing striping, may find 15 highly variable performance results with using the as-iosched. The 16 as-iosched anticipatory implementation is based on the notion that a disk 17 device has only one physical seeking head. A striped RAID controller 18 actually has a head for each physical device in the logical RAID device. 19 20 However, setting the antic_expire (see tunable parameters below) produces 21 very similar behavior to the deadline IO scheduler. 22 23 Selecting IO schedulers 24 ----------------------- 25 Refer to Documentation/block/switching-sched.txt for information on 26 selecting an io scheduler on a per-device basis. 27 28 Anticipatory IO scheduler Policies 29 ---------------------------------- 30 The as-iosched implementation implements several layers of policies 31 to determine when an IO request is dispatched to the disk controller. 32 Here are the policies outlined, in order of application. 33 34 1. one-way Elevator algorithm. 35 36 The elevator algorithm is similar to that used in deadline scheduler, with 37 the addition that it allows limited backward movement of the elevator 38 (i.e. seeks backwards). A seek backwards can occur when choosing between 39 two IO requests where one is behind the elevator's current position, and 40 the other is in front of the elevator's position. If the seek distance to 41 the request in back of the elevator is less than half the seek distance to 42 the request in front of the elevator, then the request in back can be chosen. 43 Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. 44 This favors forward movement of the elevator, while allowing opportunistic 45 "short" backward seeks. 46 47 2. FIFO expiration times for reads and for writes. 48 49 This is again very similar to the deadline IO scheduler. The expiration 50 times for requests on these lists is tunable using the parameters read_expire 51 and write_expire discussed below. When a read or a write expires in this way, 52 the IO scheduler will interrupt its current elevator sweep or read anticipation 53 to service the expired request. 54 55 3. Read and write request batching 56 57 A batch is a collection of read requests or a collection of write 58 requests. The as scheduler alternates dispatching read and write batches 59 to the driver. In the case a read batch, the scheduler submits read 60 requests to the driver as long as there are read requests to submit, and 61 the read batch time limit has not been exceeded (read_batch_expire). 62 The read batch time limit begins counting down only when there are 63 competing write requests pending. 64 65 In the case of a write batch, the scheduler submits write requests to 66 the driver as long as there are write requests available, and the 67 write batch time limit has not been exceeded (write_batch_expire). 68 However, the length of write batches will be gradually shortened 69 when read batches frequently exceed their time limit. 70 71 When changing between batch types, the scheduler waits for all requests 72 from the previous batch to complete before scheduling requests for the 73 next batch. 74 75 The read and write fifo expiration times described in policy 2 above 76 are checked only when in scheduling IO of a batch for the corresponding 77 (read/write) type. So for example, the read FIFO timeout values are 78 tested only during read batches. Likewise, the write FIFO timeout 79 values are tested only during write batches. For this reason, 80 it is generally not recommended for the read batch time 81 to be longer than the write expiration time, nor for the write batch 82 time to exceed the read expiration time (see tunable parameters below). 83 84 When the IO scheduler changes from a read to a write batch, 85 it begins the elevator from the request that is on the head of the 86 write expiration FIFO. Likewise, when changing from a write batch to 87 a read batch, scheduler begins the elevator from the first entry 88 on the read expiration FIFO. 89 90 4. Read anticipation. 91 92 Read anticipation occurs only when scheduling a read batch. 93 This implementation of read anticipation allows only one read request 94 to be dispatched to the disk controller at a time. In 95 contrast, many write requests may be dispatched to the disk controller 96 at a time during a write batch. It is this characteristic that can make 97 the anticipatory scheduler perform anomalously with controllers supporting 98 TCQ, or with hardware striped RAID devices. Setting the antic_expire 99 queue parameter (see below) to zero disables this behavior, and the 100 anticipatory scheduler behaves essentially like the deadline scheduler. 101 102 When read anticipation is enabled (antic_expire is not zero), reads 103 are dispatched to the disk controller one at a time. 104 At the end of each read request, the IO scheduler examines its next 105 candidate read request from its sorted read list. If that next request 106 is from the same process as the request that just completed, 107 or if the next request in the queue is "very close" to the 108 just completed request, it is dispatched immediately. Otherwise, 109 statistics (average think time, average seek distance) on the process 110 that submitted the just completed request are examined. If it seems 111 likely that that process will submit another request soon, and that 112 request is likely to be near the just completed request, then the IO 113 scheduler will stop dispatching more read requests for up to (antic_expire) 114 milliseconds, hoping that process will submit a new request near the one 115 that just completed. If such a request is made, then it is dispatched 116 immediately. If the antic_expire wait time expires, then the IO scheduler 117 will dispatch the next read request from the sorted read queue. 118 119 To decide whether an anticipatory wait is worthwhile, the scheduler 120 maintains statistics for each process that can be used to compute 121 mean "think time" (the time between read requests), and mean seek 122 distance for that process. One observation is that these statistics 123 are associated with each process, but those statistics are not associated 124 with a specific IO device. So for example, if a process is doing IO 125 on several file systems on separate devices, the statistics will be 126 a combination of IO behavior from all those devices. 127 128 129 Tuning the anticipatory IO scheduler 130 ------------------------------------ 131 When using 'as', the anticipatory IO scheduler there are 5 parameters under 132 /sys/block/*/queue/iosched/. All are units of milliseconds. 133 134 The parameters are: 135 * read_expire 136 Controls how long until a read request becomes "expired". It also controls the 137 interval between which expired requests are served, so set to 50, a request 138 might take anywhere < 100ms to be serviced _if_ it is the next on the 139 expired list. Obviously request expiration strategies won't make the disk 140 go faster. The result basically equates to the timeslice a single reader 141 gets in the presence of other IO. 100*((seek time / read_expire) + 1) is 142 very roughly the % streaming read efficiency your disk should get with 143 multiple readers. 144 145 * read_batch_expire 146 Controls how much time a batch of reads is given before pending writes are 147 served. A higher value is more efficient. This might be set below read_expire 148 if writes are to be given higher priority than reads, but reads are to be 149 as efficient as possible when there are no writes. Generally though, it 150 should be some multiple of read_expire. 151 152 * write_expire, and 153 * write_batch_expire are equivalent to the above, for writes. 154 155 * antic_expire 156 Controls the maximum amount of time we can anticipate a good read (one 157 with a short seek distance from the most recently completed request) before 158 giving up. Many other factors may cause anticipation to be stopped early, 159 or some processes will not be "anticipated" at all. Should be a bit higher 160 for big seek time devices though not a linear correspondence - most 161 processes have only a few ms thinktime. 162 163 In addition to the tunables above there is a read-only file named est_time 164 which, when read, will show: 165 166 - The probability of a task exiting without a cooperating task 167 submitting an anticipated IO. 168 169 - The current mean think time. 170 171 - The seek distance used to determine if an incoming IO is better.