[D] Is accurately estimating image quality even possible?
I wanted to create something that could take a dataset of images and filter out the low quality images. It sounded easy but I'm now convinced it's not yet possible.
I created a paired dataset of youtube video frames. I used 30k images at 480p and 30k matching images at 1080p, with 5 evenly spread frames for each of 6000 videos.
My first idea was to use LPIPS, a method using activations of a pretrained net to measure similarity between two images. If the LPIPS distance between the 480p resized to 1080p and the original 1080p was high then I assumed it meant the 1080p frame was of high quality and not just basically an enlarged copy of the 480p frame (not all 1080p videos are created equal!)
This turned out to pretty much just be a frequency detector and didn't correlate all that well with visually perceived quality. Any image with high frequency textures was ranked as high quality and images with low frequency areas (like a solid colored wall) were ranked as low quality.
I suppose this made sense, so I tried another approach - training a CNN classifier to predict if a patch of a frame belonged to a 480p or 1080p video. This ended up interestingly doing the opposite. Outdoor images or anything with high frequency texture was considered low quality regardless of actual quality. This is because if you take a 1080p image and reduce its size to 480p you are increasing its frequency, so the best discriminator for classifying between the two becomes its frequency. I trained it again and this time I resized all 480p images to be 1080p so the only difference between them should be quality. I got 100% accuracy on validation data and couldn't believe it. It ended up being that it learned to detect if an image has been resized. Any resized image will give a low quality score. You could take the golden standard image and upscale it and it will detect it as low quality.
So at this point I did some googling to see if there is a state of the art for this kind of thing. I found BRISQUE and its results may be slightly better but it still just ranked any high frequency texture as being high quality. What's worse is it will always rank a 480p frame as higher quality than its 1080p version. So it is also essentially just a frequency detector. The problem is frequency is not the same thing as human perceived quality. Some objects or scenes simply don't have high frequency textures but should still be seen as high quality if they were captured with a good camera.
I'm interested if anyone knows another technique or an idea to try since I spent a lot of time making this dataset.