To IEEE or to MMX?

Recently, Tom's hardware guide published a test on Intel's brand-new Pentium4. He also ran a test using FlaskMpeg and his findings were that the P4 performed significantly better than any AMD CPU. Soon after his test he wrote a follow up on the issue. A reader of his suggested to use the IEEE-1180 iDCT instead of the default MMX iDCT. Tom ran another test and this time the performance conclusions were quite different. Due to the Athlons superior FPU unit the Intel chip scored pretty badly. Effectively, the 300MHz faster P4 didn't stand a chance against all Athlon chips in the test.

Immediately after the article it started to rumor in the ripping scene. People started wondering if they'd been all wrong to use the MMX iDCT all the time as suggested in all the FlaskMpeg guides I know of. Clearly, this was reason enough for me to run my own tests and here are my findings.

Test Setup

Contrary to Tom's I will provide you with the full specs of my tests so that you can duplicate my results.

I use the R1 release of "The Matrix". I ripped chapters 28-30 as separate files using SmartRipper. Chapter 28 is the interrogation of Morpheus in the police building and has a lot of noise in the gray walls and it's clearly a low-motion scene. Chapter 29 is the lobby battle and the probably most famous scene in the movie. Clearly this scene involves heavy action and pushes the codec to its limits. Chapter 30 includes the fight on the roof and the explosion of the elevator - the most difficult scene in the whole movie.

I ran 3 tests at the following settings:

All tests were performed using FlaskMpeg 0.594, using the AVI Output plugin 0.591 and the DivX 0.311alpha low-motion codec. I set crispness to 100. The resizing algorithm used in FlaskMpeg was HQ bicubic, audio processing disabled. I run each test twice, once using the MMX iDCT and IEEE-1180 reference iDCT.

The results

First of all, let's take a look at the FlaskMpeg readme file. It states:

"The video information inside MPEG files is stored in the frequency domain rather than in the spatial domain (the images we see). That way, the information gets compacted and that compactation can be used to compress (reduce) the amount of information you have to send over the transmission channel. MPEG uses the DCT (Discrete Cosine Transform) to translate spatial information into frequency information. To bring back the spatial information from the MPEG stream you have to apply the iDCT, that is, the Inverse Discrete Cosine Transform, that undoes the DCT that was used during encoding.

Although MPEG is almost deterministic (given a MPEG stream the output should be identical in all decoders), the standard has a degree of freedom when choosing the iDCT to use. That way, the decoder can be more easily implemented depending on the hardware below it. What the standard requires from the decoder is that the iDCT meets IEEE-1180 specs, or in plain words, that the error from the iDCT doesn't go beyond that the ones pointed out in the IEEE-1180.

Right now, FlasK MPEG has three algorithms to perform the iDCT, all IEEE-1180 compliant. A MMX one, an integer based one and one using floating point numbers. Even when all are IEEE compliant, the floating point one is more accurate but it takes a lot more CPU time. The integer one should be enough for almost everybody without MMX and the MMX iDCT should be the default option for almost everyone. "

Or in other words: all 3 algorithms are IEEE-1180 compliant and Flasky himself suggests to use the MMX iDCT.

So far so good..

The speed difference was quite noticeable: In the 1800kbit/s clip FlaskMpeg encoded at 6.82fps on my P3-550 whereas the same clip using the reference quality iDCT got down to 2.08fps. Is the quality difference really worth the extra encoding time?

mmx idtc

ieee reference

So which one is which? That's the big question. I decided not to spoil the fun for you.. If you let the cursor rest over the image you'll see the a small note telling you. Below I've put some larger shots showing the red marked area.

mmx idctieee reference

It's quite hard to tell which one is which, right? I must add that the source frame had a lot of noise in the walls... that's the reason why there's so many macroblocks. Even using high bitrates some of these blocks will stay. That's just the way of DivX to compress the image. I also tested TMPG in 2pass mode and it recreated the noise instead of being blocky. I leave it up to you to decide which one is better and let's go to the next collection of screenshots instead.

ieee reference

mmx idct

ieee referencemmx idct

Are these pictures really not from the same source? I assure you they're not. One file took more than 3 times longer to encode than the other.

Let's take a look at the explosion scene a couple of seconds later:

mmx idct

ieee reference

And the zoomed version:

mmx idctieee reference

And to make things perfect yet another series of screenshots:

mmx idct

ieee referencemmx idct

 

Conclusions

By now you might be wondering: What is this all about? Well... it's really useful to illustrate my point.

I've been watching these clips a couple of times and I failed to see any noticeable difference. Of course I was watching the clips without sound and with all the lights in my room turned off to have the full movie feeling. I also up-scaled the movies to full-screen (1152x864) in order to be able to spot errors more easily. From previous tests and my bitrate tips you might already know that when it comes to quality I tolerate no compromise. But here, really, I tried hard to spot a difference but I didn't succeed.

As you can see from the screenshots the difference between MMX iDCT and IEEE-1180 reference iDCT is not noticeable. Neither at the 1500kbit/s clips where I've taken the screenshots from, nor at lower or higher bitrates.

On Tom's hardware guide it has been written:

"However, this tends to produce a lot of artifacts in the final MPEG-4 video because all the pixel values of the decoded frames are approximations. Thus when a second DCT transform is applied to convert it to MPEG-4 it tends to approximate again and produce really horrible artifacts in some cases.

Using the IEEE decode eliminates most of these artifacts and produces an output that rivals most DVDs when set to about 20% of the original bit rate (1.5mbps for a 7.5mbps DVD like Matrix). "

That's exactly what I did.. I encoded Matrix at 1500kbit/s. And yet: THERE IS NO DIFFERENCE. Sorry Toby but you're wrong.. My sight is still very good and you can believe me that I spot encoding errors where others won't notice any flaws. But in this case the IEEE reference algorithm provided no noticeable effect except in a more than 3 times longer encoding time.

Or the short version: You can sleep well again in the knowledge that you won't have to reencode all your DVDs and that you won't have to buy a 1.2GHz Athlon just to be able to encode DivX at a lousy 6fps.

What the article on Tom's hardware guide illustrates is only this: AMD has a lot better FPU than Intel. Maybe that will have an impact on MPEG-2 encoding but it sure does not have one on MPEG-4 encoding. I think the reference quality iDCT is simply taken from the reference implementation by the MPEG software simulation group. Usually these implementation are good in terms of quality but they suck pretty badly when it comes to speed. If a software DVD player used the reference iDCT you'd be watching slideshows for the next 5 years... all useful iDCT algorithms have been at least MMX optimized - but which doesn't necessarily mean that the one used in FlaskMpeg is best.. there certainly are better ones.

Let's also consider this: In the initial article about DivX there have been a couple of nasty errors on part of Tom. Blight - www.inmatrix.com - pointed out most of these. Among these was the suggestion to use next neighbor resizing which results in really bad encoding. That has been amended since but they still write that for small filesize you should use that kind of filtering. But believe me, you don't want to use it, it looks terribly. And I don't think that any IEEE reference algorithm can fix what this kind of filtering destroyed.