About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / mtd / nand_ecc.txt


Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.

1	Introduction
2	============
3	
4	Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
5	I felt there was room for optimisation. I bashed the code for a few hours
6	performing tricks like table lookup removing superfluous code etc.
7	After that the speed was increased by 35-40%.
8	Still I was not too happy as I felt there was additional room for improvement.
9	
10	Bad! I was hooked.
11	I decided to annotate my steps in this file. Perhaps it is useful to someone
12	or someone learns something from it.
13	
14	
15	The problem
16	===========
17	
18	NAND flash (at least SLC one) typically has sectors of 256 bytes.
19	However NAND flash is not extremely reliable so some error detection
20	(and sometimes correction) is needed.
21	
22	This is done by means of a Hamming code. I'll try to explain it in
23	laymans terms (and apologies to all the pro's in the field in case I do
24	not use the right terminology, my coding theory class was almost 30
25	years ago, and I must admit it was not one of my favourites).
26	
27	As I said before the ecc calculation is performed on sectors of 256
28	bytes. This is done by calculating several parity bits over the rows and
29	columns. The parity used is even parity which means that the parity bit = 1
30	if the data over which the parity is calculated is 1 and the parity bit = 0
31	if the data over which the parity is calculated is 0. So the total
32	number of bits over the data over which the parity is calculated + the
33	parity bit is even. (see wikipedia if you can't follow this).
34	Parity is often calculated by means of an exclusive or operation,
35	sometimes also referred to as xor. In C the operator for xor is ^
36	
37	Back to ecc.
38	Let's give a small figure:
39	
40	byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
41	byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
42	byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
43	byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
44	byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
45	....
46	byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
47	byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
48	           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
49	           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
50	           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
51	
52	This figure represents a sector of 256 bytes.
53	cp is my abbreviation for column parity, rp for row parity.
54	
55	Let's start to explain column parity.
56	cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
57	so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
58	Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
59	cp2 is the parity over bit0, bit1, bit4 and bit5
60	cp3 is the parity over bit2, bit3, bit6 and bit7.
61	cp4 is the parity over bit0, bit1, bit2 and bit3.
62	cp5 is the parity over bit4, bit5, bit6 and bit7.
63	Note that each of cp0 .. cp5 is exactly one bit.
64	
65	Row parity actually works almost the same.
66	rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
67	rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
68	rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
69	(so handle two bytes, then skip 2 bytes).
70	rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
71	for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
72	so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
73	and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
74	The story now becomes quite boring. I guess you get the idea.
75	rp6 covers 8 bytes then skips 8 etc
76	rp7 skips 8 bytes then covers 8 etc
77	rp8 covers 16 bytes then skips 16 etc
78	rp9 skips 16 bytes then covers 16 etc
79	rp10 covers 32 bytes then skips 32 etc
80	rp11 skips 32 bytes then covers 32 etc
81	rp12 covers 64 bytes then skips 64 etc
82	rp13 skips 64 bytes then covers 64 etc
83	rp14 covers 128 bytes then skips 128
84	rp15 skips 128 bytes then covers 128
85	
86	In the end the parity bits are grouped together in three bytes as
87	follows:
88	ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
89	ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
90	ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
91	ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
92	
93	I detected after writing this that ST application note AN1823
94	(http://www.st.com/stonline/) gives a much
95	nicer picture.(but they use line parity as term where I use row parity)
96	Oh well, I'm graphically challenged, so suffer with me for a moment :-)
97	And I could not reuse the ST picture anyway for copyright reasons.
98	
99	
100	Attempt 0
101	=========
102	
103	Implementing the parity calculation is pretty simple.
104	In C pseudocode:
105	for (i = 0; i < 256; i++)
106	{
107	    if (i & 0x01)
108	       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
109	    else
110	       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0;
111	    if (i & 0x02)
112	       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
113	    else
114	       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
115	    if (i & 0x04)
116	      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
117	    else
118	      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
119	    if (i & 0x08)
120	      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
121	    else
122	      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
123	    if (i & 0x10)
124	      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
125	    else
126	      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
127	    if (i & 0x20)
128	      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
129	    else
130	      rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
131	    if (i & 0x40)
132	      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
133	    else
134	      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
135	    if (i & 0x80)
136	      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
137	    else
138	      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
139	    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
140	    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
141	    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
142	    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
143	    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
144	    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
145	}
146	
147	
148	Analysis 0
149	==========
150	
151	C does have bitwise operators but not really operators to do the above
152	efficiently (and most hardware has no such instructions either).
153	Therefore without implementing this it was clear that the code above was
154	not going to bring me a Nobel prize :-)
155	
156	Fortunately the exclusive or operation is commutative, so we can combine
157	the values in any order. So instead of calculating all the bits
158	individually, let us try to rearrange things.
159	For the column parity this is easy. We can just xor the bytes and in the
160	end filter out the relevant bits. This is pretty nice as it will bring
161	all cp calculation out of the for loop.
162	
163	Similarly we can first xor the bytes for the various rows.
164	This leads to:
165	
166	
167	Attempt 1
168	=========
169	
170	const char parity[256] = {
171	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
172	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
173	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
174	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
175	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
176	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
177	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
178	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
179	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
180	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
181	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
182	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
183	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
184	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
185	    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
186	    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
187	};
188	
189	void ecc1(const unsigned char *buf, unsigned char *code)
190	{
191	    int i;
192	    const unsigned char *bp = buf;
193	    unsigned char cur;
194	    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
195	    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
196	    unsigned char par;
197	
198	    par = 0;
199	    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
200	    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
201	    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
202	    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
203	
204	    for (i = 0; i < 256; i++)
205	    {
206	        cur = *bp++;
207	        par ^= cur;
208	        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
209	        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
210	        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
211	        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
212	        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
213	        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
214	        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
215	        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
216	    }
217	    code[0] =
218	        (parity[rp7] << 7) |
219	        (parity[rp6] << 6) |
220	        (parity[rp5] << 5) |
221	        (parity[rp4] << 4) |
222	        (parity[rp3] << 3) |
223	        (parity[rp2] << 2) |
224	        (parity[rp1] << 1) |
225	        (parity[rp0]);
226	    code[1] =
227	        (parity[rp15] << 7) |
228	        (parity[rp14] << 6) |
229	        (parity[rp13] << 5) |
230	        (parity[rp12] << 4) |
231	        (parity[rp11] << 3) |
232	        (parity[rp10] << 2) |
233	        (parity[rp9]  << 1) |
234	        (parity[rp8]);
235	    code[2] =
236	        (parity[par & 0xf0] << 7) |
237	        (parity[par & 0x0f] << 6) |
238	        (parity[par & 0xcc] << 5) |
239	        (parity[par & 0x33] << 4) |
240	        (parity[par & 0xaa] << 3) |
241	        (parity[par & 0x55] << 2);
242	    code[0] = ~code[0];
243	    code[1] = ~code[1];
244	    code[2] = ~code[2];
245	}
246	
247	Still pretty straightforward. The last three invert statements are there to
248	give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
249	all data is 0xff, so the checksum then matches.
250	
251	I also introduced the parity lookup. I expected this to be the fastest
252	way to calculate the parity, but I will investigate alternatives later
253	on.
254	
255	
256	Analysis 1
257	==========
258	
259	The code works, but is not terribly efficient. On my system it took
260	almost 4 times as much time as the linux driver code. But hey, if it was
261	*that* easy this would have been done long before.
262	No pain. no gain.
263	
264	Fortunately there is plenty of room for improvement.
265	
266	In step 1 we moved from bit-wise calculation to byte-wise calculation.
267	However in C we can also use the unsigned long data type and virtually
268	every modern microprocessor supports 32 bit operations, so why not try
269	to write our code in such a way that we process data in 32 bit chunks.
270	
271	Of course this means some modification as the row parity is byte by
272	byte. A quick analysis:
273	for the column parity we use the par variable. When extending to 32 bits
274	we can in the end easily calculate rp0 and rp1 from it.
275	(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
276	respectively, from MSB to LSB)
277	also rp2 and rp3 can be easily retrieved from par as rp3 covers the
278	first two MSBs and rp2 covers the last two LSBs.
279	
280	Note that of course now the loop is executed only 64 times (256/4).
281	And note that care must taken wrt byte ordering. The way bytes are
282	ordered in a long is machine dependent, and might affect us.
283	Anyway, if there is an issue: this code is developed on x86 (to be
284	precise: a DELL PC with a D920 Intel CPU)
285	
286	And of course the performance might depend on alignment, but I expect
287	that the I/O buffers in the nand driver are aligned properly (and
288	otherwise that should be fixed to get maximum performance).
289	
290	Let's give it a try...
291	
292	
293	Attempt 2
294	=========
295	
296	extern const char parity[256];
297	
298	void ecc2(const unsigned char *buf, unsigned char *code)
299	{
300	    int i;
301	    const unsigned long *bp = (unsigned long *)buf;
302	    unsigned long cur;
303	    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
304	    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
305	    unsigned long par;
306	
307	    par = 0;
308	    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
309	    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
310	    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
311	    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
312	
313	    for (i = 0; i < 64; i++)
314	    {
315	        cur = *bp++;
316	        par ^= cur;
317	        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
318	        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
319	        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
320	        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
321	        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
322	        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
323	    }
324	    /*
325	       we need to adapt the code generation for the fact that rp vars are now
326	       long; also the column parity calculation needs to be changed.
327	       we'll bring rp4 to 15 back to single byte entities by shifting and
328	       xoring
329	    */
330	    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
331	    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
332	    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
333	    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
334	    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
335	    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
336	    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
337	    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
338	    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
339	    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
340	    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
341	    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
342	    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
343	    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
344	    par ^= (par >> 16);
345	    rp1 = (par >> 8); rp1 &= 0xff;
346	    rp0 = (par & 0xff);
347	    par ^= (par >> 8); par &= 0xff;
348	
349	    code[0] =
350	        (parity[rp7] << 7) |
351	        (parity[rp6] << 6) |
352	        (parity[rp5] << 5) |
353	        (parity[rp4] << 4) |
354	        (parity[rp3] << 3) |
355	        (parity[rp2] << 2) |
356	        (parity[rp1] << 1) |
357	        (parity[rp0]);
358	    code[1] =
359	        (parity[rp15] << 7) |
360	        (parity[rp14] << 6) |
361	        (parity[rp13] << 5) |
362	        (parity[rp12] << 4) |
363	        (parity[rp11] << 3) |
364	        (parity[rp10] << 2) |
365	        (parity[rp9]  << 1) |
366	        (parity[rp8]);
367	    code[2] =
368	        (parity[par & 0xf0] << 7) |
369	        (parity[par & 0x0f] << 6) |
370	        (parity[par & 0xcc] << 5) |
371	        (parity[par & 0x33] << 4) |
372	        (parity[par & 0xaa] << 3) |
373	        (parity[par & 0x55] << 2);
374	    code[0] = ~code[0];
375	    code[1] = ~code[1];
376	    code[2] = ~code[2];
377	}
378	
379	The parity array is not shown any more. Note also that for these
380	examples I kinda deviated from my regular programming style by allowing
381	multiple statements on a line, not using { } in then and else blocks
382	with only a single statement and by using operators like ^=
383	
384	
385	Analysis 2
386	==========
387	
388	The code (of course) works, and hurray: we are a little bit faster than
389	the linux driver code (about 15%). But wait, don't cheer too quickly.
390	There is more to be gained.
391	If we look at e.g. rp14 and rp15 we see that we either xor our data with
392	rp14 or with rp15. However we also have par which goes over all data.
393	This means there is no need to calculate rp14 as it can be calculated from
394	rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15;
395	(or if desired we can avoid calculating rp15 and calculate it from
396	rp14).  That is why some places refer to inverse parity.
397	Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
398	Effectively this means we can eliminate the else clause from the if
399	statements. Also we can optimise the calculation in the end a little bit
400	by going from long to byte first. Actually we can even avoid the table
401	lookups
402	
403	Attempt 3
404	=========
405	
406	Odd replaced:
407	        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
408	        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
409	        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
410	        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
411	        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
412	        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
413	with
414	        if (i & 0x01) rp5 ^= cur;
415	        if (i & 0x02) rp7 ^= cur;
416	        if (i & 0x04) rp9 ^= cur;
417	        if (i & 0x08) rp11 ^= cur;
418	        if (i & 0x10) rp13 ^= cur;
419	        if (i & 0x20) rp15 ^= cur;
420	
421	        and outside the loop added:
422	        rp4  = par ^ rp5;
423	        rp6  = par ^ rp7;
424	        rp8  = par ^ rp9;
425	        rp10  = par ^ rp11;
426	        rp12  = par ^ rp13;
427	        rp14  = par ^ rp15;
428	
429	And after that the code takes about 30% more time, although the number of
430	statements is reduced. This is also reflected in the assembly code.
431	
432	
433	Analysis 3
434	==========
435	
436	Very weird. Guess it has to do with caching or instruction parallellism
437	or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
438	observation was that this one is only 30% slower (according to time)
439	executing the code as my 3Ghz D920 processor.
440	
441	Well, it was expected not to be easy so maybe instead move to a
442	different track: let's move back to the code from attempt2 and do some
443	loop unrolling. This will eliminate a few if statements. I'll try
444	different amounts of unrolling to see what works best.
445	
446	
447	Attempt 4
448	=========
449	
450	Unrolled the loop 1, 2, 3 and 4 times.
451	For 4 the code starts with:
452	
453	    for (i = 0; i < 4; i++)
454	    {
455	        cur = *bp++;
456	        par ^= cur;
457	        rp4 ^= cur;
458	        rp6 ^= cur;
459	        rp8 ^= cur;
460	        rp10 ^= cur;
461	        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
462	        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
463	        cur = *bp++;
464	        par ^= cur;
465	        rp5 ^= cur;
466	        rp6 ^= cur;
467	        ...
468	
469	
470	Analysis 4
471	==========
472	
473	Unrolling once gains about 15%
474	Unrolling twice keeps the gain at about 15%
475	Unrolling three times gives a gain of 30% compared to attempt 2.
476	Unrolling four times gives a marginal improvement compared to unrolling
477	three times.
478	
479	I decided to proceed with a four time unrolled loop anyway. It was my gut
480	feeling that in the next steps I would obtain additional gain from it.
481	
482	The next step was triggered by the fact that par contains the xor of all
483	bytes and rp4 and rp5 each contain the xor of half of the bytes.
484	So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
485	that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
486	eliminate rp5 (or rp4, but I already foresaw another optimisation).
487	The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
488	
489	
490	Attempt 5
491	=========
492	
493	Effectively so all odd digit rp assignments in the loop were removed.
494	This included the else clause of the if statements.
495	Of course after the loop we need to correct things by adding code like:
496	    rp5 = par ^ rp4;
497	Also the initial assignments (rp5 = 0; etc) could be removed.
498	Along the line I also removed the initialisation of rp0/1/2/3.
499	
500	
501	Analysis 5
502	==========
503	
504	Measurements showed this was a good move. The run-time roughly halved
505	compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
506	of the processor time compared to the current code in the linux kernel.
507	
508	However, still I thought there was more. I didn't like all the if
509	statements. Why not keep a running parity and only keep the last if
510	statement. Time for yet another version!
511	
512	
513	Attempt 6
514	=========
515	
516	THe code within the for loop was changed to:
517	
518	    for (i = 0; i < 4; i++)
519	    {
520	        cur = *bp++; tmppar  = cur; rp4 ^= cur;
521	        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
522	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
523	        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
524	
525	        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
526	        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
527	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
528	        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
529	
530	        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
531	        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
532	        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
533	        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
534	
535	        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
536	        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
537	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
538	        cur = *bp++; tmppar ^= cur;
539	
540	        par ^= tmppar;
541	        if ((i & 0x1) == 0) rp12 ^= tmppar;
542	        if ((i & 0x2) == 0) rp14 ^= tmppar;
543	    }
544	
545	As you can see tmppar is used to accumulate the parity within a for
546	iteration. In the last 3 statements is added to par and, if needed,
547	to rp12 and rp14.
548	
549	While making the changes I also found that I could exploit that tmppar
550	contains the running parity for this iteration. So instead of having:
551	rp4 ^= cur; rp6 ^= cur;
552	I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next
553	statement. A similar change was done for rp8 and rp10
554	
555	
556	Analysis 6
557	==========
558	
559	Measuring this code again showed big gain. When executing the original
560	linux code 1 million times, this took about 1 second on my system.
561	(using time to measure the performance). After this iteration I was back
562	to 0.075 sec. Actually I had to decide to start measuring over 10
563	million iterations in order not to lose too much accuracy. This one
564	definitely seemed to be the jackpot!
565	
566	There is a little bit more room for improvement though. There are three
567	places with statements:
568	rp4 ^= cur; rp6 ^= cur;
569	It seems more efficient to also maintain a variable rp4_6 in the while
570	loop; This eliminates 3 statements per loop. Of course after the loop we
571	need to correct by adding:
572	    rp4 ^= rp4_6;
573	    rp6 ^= rp4_6
574	Furthermore there are 4 sequential assignments to rp8. This can be
575	encoded slightly more efficiently by saving tmppar before those 4 lines
576	and later do rp8 = rp8 ^ tmppar ^ notrp8;
577	(where notrp8 is the value of rp8 before those 4 lines).
578	Again a use of the commutative property of xor.
579	Time for a new test!
580	
581	
582	Attempt 7
583	=========
584	
585	The new code now looks like:
586	
587	    for (i = 0; i < 4; i++)
588	    {
589	        cur = *bp++; tmppar  = cur; rp4 ^= cur;
590	        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
591	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
592	        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
593	
594	        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
595	        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
596	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
597	        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
598	
599	        notrp8 = tmppar;
600	        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
601	        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
602	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
603	        cur = *bp++; tmppar ^= cur;
604	        rp8 = rp8 ^ tmppar ^ notrp8;
605	
606	        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
607	        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
608	        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
609	        cur = *bp++; tmppar ^= cur;
610	
611	        par ^= tmppar;
612	        if ((i & 0x1) == 0) rp12 ^= tmppar;
613	        if ((i & 0x2) == 0) rp14 ^= tmppar;
614	    }
615	    rp4 ^= rp4_6;
616	    rp6 ^= rp4_6;
617	
618	
619	Not a big change, but every penny counts :-)
620	
621	
622	Analysis 7
623	==========
624	
625	Actually this made things worse. Not very much, but I don't want to move
626	into the wrong direction. Maybe something to investigate later. Could
627	have to do with caching again.
628	
629	Guess that is what there is to win within the loop. Maybe unrolling one
630	more time will help. I'll keep the optimisations from 7 for now.
631	
632	
633	Attempt 8
634	=========
635	
636	Unrolled the loop one more time.
637	
638	
639	Analysis 8
640	==========
641	
642	This makes things worse. Let's stick with attempt 6 and continue from there.
643	Although it seems that the code within the loop cannot be optimised
644	further there is still room to optimize the generation of the ecc codes.
645	We can simply calculate the total parity. If this is 0 then rp4 = rp5
646	etc. If the parity is 1, then rp4 = !rp5;
647	But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
648	in the result byte and then do something like
649	    code[0] |= (code[0] << 1);
650	Lets test this.
651	
652	
653	Attempt 9
654	=========
655	
656	Changed the code but again this slightly degrades performance. Tried all
657	kind of other things, like having dedicated parity arrays to avoid the
658	shift after parity[rp7] << 7; No gain.
659	Change the lookup using the parity array by using shift operators (e.g.
660	replace parity[rp7] << 7 with:
661	rp7 ^= (rp7 << 4);
662	rp7 ^= (rp7 << 2);
663	rp7 ^= (rp7 << 1);
664	rp7 &= 0x80;
665	No gain.
666	
667	The only marginal change was inverting the parity bits, so we can remove
668	the last three invert statements.
669	
670	Ah well, pity this does not deliver more. Then again 10 million
671	iterations using the linux driver code takes between 13 and 13.5
672	seconds, whereas my code now takes about 0.73 seconds for those 10
673	million iterations. So basically I've improved the performance by a
674	factor 18 on my system. Not that bad. Of course on different hardware
675	you will get different results. No warranties!
676	
677	But of course there is no such thing as a free lunch. The codesize almost
678	tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
679	
680	
681	Correcting errors
682	=================
683	
684	For correcting errors I again used the ST application note as a starter,
685	but I also peeked at the existing code.
686	The algorithm itself is pretty straightforward. Just xor the given and
687	the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
688	are 1 we have one correctable bit error. If there is 1 bit 1, we have an
689	error in the given ecc code.
690	It proved to be fastest to do some table lookups. Performance gain
691	introduced by this is about a factor 2 on my system when a repair had to
692	be done, and 1% or so if no repair had to be done.
693	Code size increased from 330 bytes to 686 bytes for this function.
694	(gcc 4.2, -O3)
695	
696	
697	Conclusion
698	==========
699	
700	The gain when calculating the ecc is tremendous. Om my development hardware
701	a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
702	embedded system with a MIPS core a factor 7 was obtained.
703	On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
704	5 (big endian mode, gcc 4.1.2, -O3)
705	For correction not much gain could be obtained (as bitflips are rare). Then
706	again there are also much less cycles spent there.
707	
708	It seems there is not much more gain possible in this, at least when
709	programmed in C. Of course it might be possible to squeeze something more
710	out of it with an assembler program, but due to pipeline behaviour etc
711	this is very tricky (at least for intel hw).
712	
713	Author: Frans Meulenbroeks
714	Copyright (C) 2008 Koninklijke Philips Electronics NV.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog