PixelNN – Example-Based Image Synthesis

cs.cmu.edu

497 points by pentestercrab 7 years ago

otp124 7 years ago

I used to roll my eyes at crime television shows, whenever they said "Enhance" for a low quality image.

Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.

ACow_Adonis 7 years ago

Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
You fundamentally can't get back information that has been destroyed/or never captured in the first place.
What you can do is fill in the gaps/information with plausible values.
I don't know whether this sounds like I'm splitting hairs, but it's really important that the general public not think we're extracting information in these procedures, we're interpolating or projecting information that is not there.
Very useful for artificially generating skins for each shoe on a shoe rack in a computer game or simulation, potentially disastrous if the general public starts to think it's applicable to security camera footage or admissible as evidence...
- ZeroGravitas 7 years ago
  
  To give specific examples from their test data, it added stubble to people who didn't have stubble, gave them a different shape of glasses, changed the color of cats, changed the color and brand of sport shoe.
  And even then, I'm a little suspicious of how close some of the images got to original without being given color information.
  It appears that info was either hidden in the original in a way not apparent to humans or was implicit in their data set in some way that would make it fail on photos of people with different skin tones.
  
  omtinez 7 years ago
  
  I haven't read the paper in full detail, but reading between the lines I'm guessing that there's a significant portion of manual processing and hand waving involved. From the abstract, emphasis mine:
  > the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner.
  My interpretation is that they select training data by hand and generate a bunch of outputs. Repeating the process until they like the final result. From the paper:
  > we allow a user to have an arbitrarily-fine level of control through on-the-fly editing of the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).
  
  WhitneyLand 7 years ago
  
  There's nothing weak or negative about that, it's exactly what'd you expect. Obviously for a given input there will be multiple plausible outputs. With any such system it would make sense to allow some control in choosing among the outputs.
  
  IshKebab 7 years ago
  
  Could be pretty great for police sketch artists. (Although pretty misleading for juries too.)
  
  adrianN 7 years ago
  
  Just train the model with the suspect's Facebook photostream and presto you have convincing evidence.
  
  chaosite 7 years ago
  
  Sounds similar to the problems with JBIG2 lossy compression.
  https://en.wikipedia.org/wiki/JBIG2#Disadvantages
- sweezyjeezy 7 years ago
  
  > Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
  I would argue that this is a form of enhancement though, and in some cases will be enough to completely reconstruct the original image. For example, if I give you a scanned PDF, and you know for a fact that it was size 12 black Ariel text on a white background, this can feasibly let you reconstruct the original image perfectly. The 'prior' that has been encoded by the model from the large amount of other images increases the mutual information between grainy image and high-res. The catch is that uncertainty cannot be removed entirely, and you need to know that the target image comes from roughly the same distribution as the training set. But knowing this gives you information that is not encoded in the pixels themselves, so you can't necessarily argue that some enhancement is impossible. For example with celebrity images, if the model is able to figure out who is in the picture, this massively decreases the set of plausible outputs.
  
  LeifCarrotson 7 years ago
  
  Or it might not! This reminds me of the Xerox bug from a couple years ago, that turned one number into another.
  http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
  Enhancing some things incorrectly would be worse than leaving them ambiguous.
  
  trevyn 7 years ago
  
  > The catch is that you need to know that the target image comes from roughly the same distribution as the training set.
  When humans think about "enhance", they imagine extracting subtle details that were not obvious from the original, which implies that they know very little about what distribution the original image comes from. If they did, they wouldn't have a need for "enhance" 99% of the time -- the remaining 1% is for artistic purposes, which this is indeed suited for.
  It'll be interesting to see how society copes with the removal of the "photographs = evidence" prior.
  > when enhancing celebrity images, if the model is able to figure out who is in the picture this massively decreases the set of plausible outputs.
  This is an excellent insight.
  
  ZenPsycho 7 years ago
  
  Do you think knowing which state the license plate is from is enough prior knowledge?
  
  yorwba 7 years ago
  
  See also https://dheera.net/projects/blur
- usrusr 7 years ago
  
  Yeah, replace the training set with cartoon characters and the crime show dialog goes like this:
  "Zoom! Enhance! Zoom! Enhance! Enhance! Oh my god it's full of Smurfs..."
- 7952 7 years ago
  
  The benefit depends on how predictable the phenomenon is that your are interpolating from. Sometimes it will be quantitatively better than a low resolution version, sometimes not.
  A good example is with compression algorithms for media. They work because the sound or image is predictable. And they are ineffective when the input becomes more unpredictable. But if the output is all you have then running the decompression will probably be better than just reading the raw compressed data. But you have to be aware of the limitations.
- matt4077 7 years ago
  
  > You fundamentally can't get back information that has been destroyed/or never captured in the first place.
  I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
  Because it's fundamentally flawed, especially in the context that it has usually been applied to, namely criticising the CSI:XYZ trope of "enhancing images".
  The truth is that there is a lot more information in a low-res image than meets the eye.
  Even if you can't read the letters on a license plate, it can be recovered by an algorithm. If the Empire State Building is in the background, it's likely to be a US license plate. Maybe only some letters would result in the photo's low-res pattern. If you only see part of a letter, knowing the font may allow you to rule out many letters or numbers etc...
  It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
  The error mostly appears to be in assuming that the information has been destroyed, when in reality it's often just obscured. And Neural Nets are excellent in squeezing all the information out noisy data.
  
  amelius 7 years ago
  
  > It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
  The effect does not only need to be deterministic, but also invertible.
  A low-res image has multiple "inverses" (yikes), supposedly each with an associated probability (if you would model it that way). So it would be more honest if the algorithm shows them all.
  
  tentaTherapist 7 years ago
  
  Showing them all seems a bit impossible because the number would blow up really quickly, wouldn't it? Maybe it could categorise them, but that could be misleading, too... I don't know.
  
  pedrosorio 7 years ago
  
  It's what they call an https://en.wikipedia.org/wiki/Inverse_problem
  
  tentaTherapist 7 years ago
  
  That is a very well-named problem.
  
  asfdsfggtfd 7 years ago
  
  >> You fundamentally can't get back information that has been destroyed/or never captured in the first place.
  > I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
  It is not a cliche it is an absolute truth. Information not present cannot be retrieved. There may be more information present than is immediately obvious.
  > Neural Nets are excellent in squeezing all the information out noisy data
  Maybe but they are also good at overfitting onto noisy data (the original article is an example of such overfitting).
  
  rootw0rm 7 years ago
  
  It's not cliché, it's true. You fundamentally can't get back information that has been destroyed/or never captured in the first place.
  Yes, a low-res image has lots of information. You can process that information in many ways. Missing data can't just be magically blinked into existence though.
  Copy/pasting bits of guessed data is NOT getting back information that has been destroyed or never captured. Obscured data is very different from non-existent data. Could the software recreate a destroyed painting of mine based on a simple sketch? Of course not, because it would have to invent details it knows nothing about.
  I think it's almost dangerous to call this line of thinking cliché. It should be celebrated, not ridiculed.
- scarface74 7 years ago
  
  What you can do though, in limited circumstances, is create a still picture with more detail from a lower quality video.
  https://photo.stackexchange.com/questions/17098/csi-image-re...
  
  murkle 7 years ago
  
  It's a well-known technique in astronomy, eg https://www.aanda.org/articles/aa/ps/2005/22/aa2320-04.ps.gz
  
  mcbits 7 years ago
  
  For anyone put off by the .ps.gz, it's actually just a normal web page that links to the full article in HTML and PDF. Not sure what they were thinking with that URL. I almost didn't bother to look. (Maybe that's what they were thinking?)
  
  dahart 7 years ago
  
  I seem to remember from my computer vision class way back when that there's a fundamental theoretical limit to the amount of detail you can get out of a moving sequence. Recovering frequencies a little higher than the pixel sampling is definitely possible, but I feel like it was maybe something like 10x theoretical maximum. I also get the feeling, from looking around at available software, that in practice achieving 2-3x is the most you can get in ideal conditions, and most video is far from ideal.
- k__ 7 years ago
  
  On the other hand, this is what the brain does all the time.
  
  eternalban 7 years ago
  
  Wouldn't it be ironic if a mystified and superstitious GAI emerges out of all these efforts.
- throwaway613834 7 years ago
  
  > I don't know whether this sounds like I'm splitting hairs
  Somewhat no, but somewhat yes. Thing is, while there can be lots of input images that generate the same output, it could be that only one (or a handful) of them would occur in reality. If this happens to sometimes be the case, and if you could somehow guarantee this was the case in some particular scenario, it could very well make sense to admit it as evidence. Of course, the issue is that figuring this out may not be possible...
- jonathanstrange 7 years ago
  
  The white shoe output vs black shoe output illustrates this fairly well.
- WhitneyLand 7 years ago
  
  >we're interpolating or projecting information that is not there
  But that's not fully accurate either. Sometimes the information in total will really be a more accurate representation of reality than the blurred image. Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.
  It would be interesting to see the results starting with higher quality images. With the camera quality increasing, many times there should be more data to start with.
  A
  
  phkahler 7 years ago
  
  >> Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.
  When is a guess invaluable?
  
  WhitneyLand 7 years ago
  
  When it identifies an established terrorist and prevents a mass casualty event.
- teekert 7 years ago
  
  Exactly, this may be possible: [0] but only of the NN has seen such images before, the output will match the training data but says nothing about reality.
  [0] https://i.pinimg.com/originals/b5/29/1b/b5291bba7250abd12010...
- gus_massa 7 years ago
  
  In comparison of output vs original it is clear that the skin color is not accurate.
- kevin_thibedeau 7 years ago
  
  "Ladies and gentlemen of the jury, we will definitively prove that the black smudge captured on camera was in fact a gun"
  Already being done today with DNA.
- jopsen 7 years ago
  
  Sometimes US justice system seems very "approximate". So why not convict people based on interpolated evidence?
  - I'm joking of course :) hehe
  
  dahart 7 years ago
  
  Sadly, it actually happens sometimes.
  https://www.wired.com/2017/04/courts-using-ai-sentence-crimi...
  This thread a year ago worried about it too, but the paper itself seems implausible and problematic.
  https://news.ycombinator.com/item?id=12983827
  
  WillReplyfFood 7 years ago
  
  You seem like a funny guy- the batmaNN interpolates that you are very likely a joker.
- gambiting 7 years ago
  
  No, but think of these blurred images as a "hash" - in an ideal situation, you only have one value that encodes to a certain hash value, right? So If you are given a hash X you technically can work out that it was derived from value Y - you're not getting back information that was lost - in a way it was merely encoded into the blurred image, and it should be possible to produce a real image which, when blurred, will match what you have.
  Don't get me wrong, I think we're still far far far off situation where we can get those reliably, but I can see how you could get the actual face out of a blurred image.
  
  ComputerGuru 7 years ago
  
  > you only have one value that encodes to a certain hash value, right?
  Errr wrong. A perfect hash, yes. But they're never perfect. You have a collision domain and you hope that you don't have enough inputs to trigger a birthday paradox.
  Look at the pictures on the article. It's an outline of the shoe. That's your hash. ANY shoe with that general outline resolves to that same hash.
  If your input is objects found in the Oxford English Dictionary, you'll have low collisions. An elephant doesn't hash to that outline. But if your inputs is the Kohl's catalog, you'll have an unacceptable collision rate.
  Hashes are attempts at creating a _truncated_ "unique" representation of an input. They throw away data they hope isn't necessary to uniquely identify between possible inputs (bits). A perfect hash for all possible 32 bit values is 32 bits. You can't even have a collision free 31 bit hash.
  So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
  
  chrismorgan 7 years ago
  
  FYI (not because it’s particularly relevant to the sort of hashing that is being talked about, but because it’s a useful piece of info that might interest people, and corrects what I think is a misunderstanding in the parent comment): perfect hash functions are a thing, and are useful: https://en.wikipedia.org/wiki/Perfect_hash_function. So long as you’re dealing with a known, finite set of values, you can craft a useful perfect hash function. As an example of how this can be useful, there’s a set of crates in Rust that make it easy to generate efficient string lookup tables using the magic of perfect hash functions: https://github.com/sfackler/rust-phf#phf_macros. (A regular hash map for such a thing would be substantially less efficient.)
  Crafting a perfect hash function with keys being the set of words from the OED is perfectly reasonable. It’ll take a short while to produce it, but it’ll work just fine. (rust-phf says that it “can generate a 100,000 entry map in roughly .4 seconds when compiling with optimizations”, and the OED word count is in the hundreds of thousands.)
  
  ComputerGuru 7 years ago
  
  Yeah, I debated bringing it up but since we were in the context of not knowing set members ahead of time, decided not to.
  Thanks for the rust-phf link. I'm bookmarking for my next project!
  
  jaclaz 7 years ago
  
  >So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
  For a face, sure, for printed text/license plates there are effective deblurring algorithms that in some cases may rebuild a readable image.
  A (IMHO good) software is this one (was freeware, now it is Commercial, this is the last freeware version):
  https://github.com/Y-Vladimir/SmartDeblur/downloads
  You can try it (just for the fun of it) on these two images:
  https://articles.forensicfocus.com/2014/10/08/can-you-get-th...
  https://forensicfocus.files.wordpress.com/2014/09/out-of-foc...
  https://forensicfocus.files.wordpress.com/2014/09/moving-car...
  For the first choose "Out of Focus Blur" and play with the values, you should get a decent image at roughly Radius 8, Smooth 40%, Correction Strength 0%, Edge Feather 10%
  For the second choose "motion Blur" and play with the values, you should get a decent image at roughly Length 14, Angle 34, Smooth 50%,
  
  consp 7 years ago
  
  Fortunately there is a limit: the universe (in a practical sense). You cannot encode all states it has in a hash as it would require more states than you want to encode as you already mentioned (pigeon hole). But representing macroscopic data like text (or basically anything bigger than atomic scale) uniquely can be done with 128+ bits. Double that and you are likely safe for collisions, assuming the method you use is uniform and not biased to some input.
  If you want ease collision examples you can take a look at people using CRC32 as hashes/digests. It is notoriously prone to collisions (since only 32 bits).
  
  IncRnd 7 years ago
  
  That won't work. A lot of people have tried to create systems that they claim always compress movies or files or something else. Yet, none of those systems ever come to market. They get backers to give them cash, then they disappear. The reason they don't come to market is that they don't exist. Look up the pigeon-hole principle. It's the very first principle of data compression.
  You can't compress a file by repeatedly storing a series of hashes, then hashes of those hashes, down into smaller and smaller representations. The reason that you cannot do this is that you cannot create a lossless file smaller than the original entropy. If you could happen to do so, however, you would get down to ever smaller files, until you had one byte left. But, you could never decompress such a file, because there is no single correct interpretation of such a decompression. In other words, your decompression is not the original file.
  
  ACow_Adonis 7 years ago
  
  Without getting too technical because I hate typing on a phone, you're technically right in the sense of a theoretical hash.
  But in real life there's collisions.
  And in real life image or sound compression, blurs, artifacts and resolutions, it is fundamentally destroying information in practice. It is no longer the comparatively difficult but theoretically possible task of reversing a perfect hash, but more like mapping a name to the characters/bucket RXXHXXXX where x could be anything.
  There are lots of values we can replace X with which are plausible, but without an outside source of information, we can't know what the real values in the original name was.
dispo001 7 years ago

Out of sheer curiosity I had a go at manually enhancing the Roundhay Garden Scene by dramatically enlarging the frames, stacking them, aligning them, erasing the most blurred ones and the obvious artifacts.
It went from this:
https://media.giphy.com/media/pUf3YfamV7BV6/giphy.gif
To this:
http://img.go-here.nl/Roundhay_Garden_Scene.gif
The funniest part was that the resolution really goes up if you make 1 px into 40 and align the frames accurately (then adjust opacity to the level of blur)
The crime television thing would be possible if you have enough frames of the gangster.
thaw13579 7 years ago

Approaches like these are hallucinating the high resolution images though--not something that we'd ever want being used for police work. That said, I wonder if it would perform better than eyewitness testimony...
- smallnamespace 7 years ago
  
  > hallucinating the high resolution images though
  To play devil's advocate though, modern neuroscience and neuropsychology basically tells us that that our brains reconstruct and recreate our memories every time we try to remember them. Our memories are highly malleable and prone to false implantation... and yet witness testimony is still the gold standard in courts.
  
  gvx 7 years ago
  
  And experts have been calling for a long time to at least limit the power of witness testimony, precisely for those reasons.
- smelterdemon 7 years ago
  
  I wouldn't want to see it used as evidence in court (and I doubt it would be allowed anyway but IANAL) but I could see this being a useful in certain circumstances for generating the photo-realistic equivalent of a police sketch e.g. if you had low-res security footage of a suspect and an eyewitness to guide the output.
- netsharc 7 years ago
  
  It would be useful to reduce the number of suspects... calculate possible combinations, match them against the mugshots database and investigate/interrogate those people. Or if you're the NSA/KGB, you can match against the social media pictures database, and then ask the social media company to tell you where these users were at the time of the crime (since the social media app on the phone track their users' location...)
- xyzzy_plugh 7 years ago
  
  You could e.g. ostensibly produce valid license plates, which could be further reduced by matching the car color and model, to produce a small set of calid records.
  
  gambiting 7 years ago
  
  Sure, but if we go by how the police works now, they will take a plate produced by the computer as 100% given and arrest/shoot the owner of that plate because "computer said so".
  
  IncRnd 7 years ago
  
  Such an algorithm would likely get the state wrong. This is error prone and fraught with real world difficulties that could get people shot.
  
  asfdsfggtfd 7 years ago
  
  You could also just pick a random license plate. It would be just as accurate.
oever 7 years ago

This image from the article shows that the original image and the fantasy image are not alike at all. The faces look to have different ages. The computer even fantasized a beauty mark.
http://www.cs.cmu.edu/~aayushb/pixelNN/freq_analysis.png
The computer is fantasizing.
- O1111OOO 7 years ago
  
  > This image from the article shows that the original image and the fantasy image are not alike at all.
  This is another avenue that could be further explored, which I quite like. That is, a non-artist can doodle images and create a completely new photo-realistic image based on the line drawings.
  I was modifying a few images (from link on another comment here: https://affinelayer.com/pixsrv/ ) and the end results were interesting.
c12 7 years ago

The low resolution to high resolution image synthesis reminds me of the unblur tool that Adobe demoed during Adobe MAX in 2011. Here is the relevant clip if you're interested https://www.youtube.com/watch?v=xxjiQoTp864
- ajnin 7 years ago
  
  That demo was quite impressive, but the technique is completely different. Adobe uses deconvolution to recover information and details that are actually in the picture, but not visible (unintuitively blurring is a mathematically reversible transformation. If you know the characteristics of the blur, then you can reverse it. In fact most of Adobe demo's magic comes from knowing the blur kernel and path in advance, not sure how it works in practice for real photos). But the Neural net demoed in this post just "makes up" the missing info using examples from photos it learned from, there is no information recovery.
seanmcdirmid 7 years ago

You'll get something that looks plausible for sure, maybe not what was originally there though. In the future, someone will be falsely convicted of a crime because a DNN enhance decided to put their picture in some fuzzy context.
jlebrech 7 years ago

It can give possible matches, i don't think it would be admissible in court. they could still trick a confession out of someone using that image.
- ZeroGravitas 7 years ago
  
  You don't specify, but presumably you mean a true confession.
  It could also be used to generate a false confession. If the prosecutor says "We have proof you were there at the scene" and shows you some generated image, then you as an innocent person have to weigh the chances of the jury being fooled by the image (and even if it's not admissable in court, it may be enough to convince the investiging team that you are responsible and stop looking for the real perpetrator) and the expected sentences if you maintain your innocence vs "admitting" your guilt.
- KGIII 7 years ago
  
  It could also narrow down the list of suspects. From there, additional investigation can find more evidence. Having access to big data can help this.
  
  jlebrech 7 years ago
  
  true, it cannot be used to "nail" a perp tho, just to help gain extra evidence.
  
  KGIII 7 years ago
  
  Yup. In a court of law, the value as evidence is going to be weighted fairly low, even with expert testimony. It may be enough to get a warrant, or a piece in the process of deduction during the investigation phase.
api 7 years ago

It's still impossible. These algorithms find in gaps with their biases, not reality. If information is not there it is not there.
mathw 7 years ago

Yes!
Although what we don't have is any certainty that the enhanced face actually looks like the killer.

nl 7 years ago

To paraphrase Google Brain's Vincent Vanhoucke, this appears to be another example where using context prediction from neighboring values outperforms an autoencoder approach.

If 2017 was the year of GANs, 2018 will be the year context prediction.

maho 7 years ago

I hope some day this will generalize to video. I don't care about the exact shape of background trees in an action movie - with this approach, they could be compressed to just a few bytes, regardless of resolution.

stepik777 7 years ago

Except that it can put trees somewhere where there were no trees but something similar to them. Or it can put face of a more popular actor instead of an actual less popular one because it was more often present in the training dataset. No, thanks.
- TuringTest 7 years ago
  
  Well, isn't that basically how Hollywood makes blockbusters?
adrianN 7 years ago

Plug in the script and some artist's impressions of the sets and generate the whole movie on the fly.
IncRnd 7 years ago

That's what video compression does now.
- roel_v 7 years ago
  
  No, today's compression is about compressing what's already in the one movie. But imagine that you run your training set over 100's or 1000's of films, and extract just enough to represent say different types of trees in a few bytes. You could 'compress' a film by replacing data with markers that essentially describe some properties of the tree, and those properties + the training set are then used during 'decompression' to recreate (an approximation of) the tree.
  This would of course not give you any space savings when you want to distribute 1 movie. There would be some minimum number of movies where the training set + actual movies would be smaller than the sum of the sizes of the individual movies compressed.
  I'm not saying this would be a net space saver, or necessarily a good technique at all, but the concept is intriguing.
  
  makapuf 7 years ago
  
  And some action movies could then be compressed to mere bytes if you basically have a virtual movie studio in your PC.
  
  eesmith 7 years ago
  
  Might make it easy to watch the Sweded version.

laythea 7 years ago

I wonder if this could be applied to "incomplete" 3D models and the work shifted to the GPU!?

joosters 7 years ago

I don't understand how the edges-to-faces can possibly work. The inputs seem to be black & white, and yet the output pictures have light skin tones.

How can their algorithm work out the skin tone from a colourless image. Perhaps their training data only had white people in it?

dahart 7 years ago

You never saw edges2cats I take it? https://affinelayer.com/pixsrv/
> I don't understand how the edges-to-faces can possibly work. The inputs seem to be black & white, and yet the output pictures have light skin tones.
The step you're missing is that an edge detector is run on the entire database of training images to produce a database of edge images. The input edge image is run against that corpus of edge images in order to find which edge images match, then sample the corresponding original color images and synthesize a new color image.
- joosters 7 years ago
  
  Thanks for that link, I'd never seen that before. In fact, the edges2shoes sample on that page exactly summarises the issue I have: You start with what effectively appears to be a rough line drawing sketch of a shoe, and the algorithm 'fills in' a realistic shoe to fit the sketch. The sketch never had any colour information and so the algorithm has to pick one for it. In their example output, the algorithm has picked a black shoe, but it could just as realistically chosen a red one. The colouring all comes from their training data (in their case, 50k shoe images from Zappos). So in short, the algorithm can't determine colour.
  But shoes and cats are one thing; reconstructing people's faces is another. I know the paper & the authors are demonstrating a technology here, rather than directly saying "you can use this technology for purpose X", but the discussion in these comments has jumped straight into enhancing images and improving existing pictures/video. But there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together. Again, in the authors' defence, they are clear that they talk about the 'synthesis' of images, but the difference is critical.
  
  dahart 7 years ago
  
  > So in short, the algorithm can't determine colour.
  That's right. But with the caveat that a large training set can determine plausible colors and rule out implausible ones. This is more true for faces than for shoes! The point is that there is some correlation between shape and color in real life. The color comes from the context in the training set. This is what @cbr meant nearby re: "skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well."
  There are CNNs trained to color images, and they do pretty well from training context: http://richzhang.github.io/colorization/
  > there is a very big line between 'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating' an image, and it appears many people are blurring the two together.
  Yep, exactly! Synthesis != enhance.
jtanderson 7 years ago

I had the same thought. Maybe it's not that there were only white people in the dataset, but it's actually taking the shape of the face into account, and it most closely matches those with white skin tones. I suggest this by looking at the cat one: it has the stripes coming off the eyes, so suggests one of the grey striped breeds rather than, e.g. all black or calico. It's probably more than pixel-by-pixel NN interpolation, but also taking into account some of the actual structure of the edges.
cbr 7 years ago

Color comes from the initial neural network step. Since skin color is relatively predictable from facial features (ex: nose width), it should be able to do reasonably well.
- joosters 7 years ago
  
  Really? With what accuracy? This is the kind of assumption that will get research groups into very deep water...
  Just imagine the kind of CCTV usage being discussed elsewhere in this thread. But the neural network happens to have a wrong bias towards skin colour...
  
  radarsat1 7 years ago
  
  You're absolutely right to be concerned about this stuff, but be aware that it is generally acknowledged as a problem and that the "ethics of machine learning" is quite an interesting and active research topic.
  One of the best articles I've read on the topic, if you're interested: https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59...
  
  dahart 7 years ago
  
  Using image synthesis at all can't be used for up-rezing CCTV imagery, the output is a fabrication and the researchers have all said so. People imagining bad use cases shouldn't be relied on. ;) If an investigator used this to track down criminals, they are the ones getting into deep water and making assumptions.

imron 7 years ago

Seems to have a thing for beards.

jokoon 7 years ago

I have a large collection of images, many being accessible through google image search.

I wonder if there could be a way to "index" those images so I can find them back without storing the whole image, using some type of clever image histogram or hashing-kind function.

I wonder if that thing already exist, since there are many images, and since most images have a lot of difference in their data, could it be possible to create some kind of function that describe an image in a way that entering such histogram redirects to (or the closest) the image it indexed? I guess I'm lacking the math, but it sounds like some "averaging" hashing function.

dannyw 7 years ago

That's perceptual hashing. Check out https://www.phash.org/
- jokoon 7 years ago
  
  Is there simpler way to implement it? This is a library, but aren't more common ways to do this?
- mlevental 7 years ago
  
  so will this do something like image recognition? ie does it work as well as surf/sift?
  
  aub3bhat 7 years ago
  
  Perceptual hashing is useful for copy detection. Its not robust to changes/transformations nor do the hashes encode any semantic information.
danielmorozoff 7 years ago

This is the current approach for large sale image retrieval. By using some model to extract features and then performing distance calculations. This is usually done with hashing once speed and the size of the dataset become large.
aub3bhat 7 years ago

I am developing a Visual Search engine with pluggable indexing models.
https://www.deepvideoanalytics.com

ChuckMcM 7 years ago

Is anyone in the FX business playing with this stuff? I'm thinking generational backdrops with groups of people/stuff/animals in them without a lot of modelling input.

XorNot 7 years ago

So is there an analagous process that would apply to audio I wonder?

dahart 7 years ago

Yes, this is a fairly similar concept: https://magenta.tensorflow.org/nsynth
This is actually training a neural network on the Markov model, so it's very similar to core ideas behind the OP's paper. The core idea is to model the probability of a bit of sound by breaking it into the last note and everything that comes before the last note ("P(audio)=P(audio∣note)P(note)"). If you sample a bunch of audio and factor it that way for any given point in time, and accumulate that data somewhere, you can then sample the accumulated data randomly to generate new music.
There are other audio NN synthesis methods as well, pretty sure I've even seen one posted to ShowHN before.
magnat 7 years ago

There kind of already is audio equivalent: MIDI. It supplies low resolution timing and pitch information and it's up to synthesizer to produce audio output matching those data.
- jeeceebees 7 years ago
  
  I think the interesting part would be example based audio synthesis. Could you replace a synthesizer with a neural network which, when fed examples, would allow you to generate sounds / explore some latent space between the examples.
  For example an approach similar to https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/ but then using the methods described in the PixelNN paper.
  
  radarsat1 7 years ago
  
  I'll just plug my recent work on my sound synthesis "copier": http://gitlab.com/sinclairs/sounderfeit
  It more or less attempts to be what you describe. Not very polished yet, but I had some basic success in modeling the parameter space of a synth, and adding new latent spaces with regularization.
jerrre 7 years ago

What would the lo-res starting point be? Low sample-rate, bit depth, ...?
- eru 7 years ago
  
  Look up compressed sensing for audio.
  (Eg first result: http://sunbeam.ece.wisc.edu/csaudio/)
- XorNot 7 years ago
  
  I was thinking like a recording of extremely muffled voices.

tinyrick2 7 years ago

This is amazing. I especially like how the result can somewhat be interpreted by showing from what image the part of the generated image is copied (see Figure 5).

deevolution 7 years ago

Apparently you grow a beard after using their nn model?

XnoiVeX 7 years ago

I noticed that too. I hope it is just a documentation error.

throwaway00100 7 years ago

No code available.

jszymborski 7 years ago

Which is sadly par for the course in this field, or at least my experience. You can always email the group...
- sosuke 7 years ago
  
  I spent too long trying to get RAISR to work when that paper came out. You can try it out from some Github repos but no one has been able to recreate the results Google presented. I would be hard pressed to say my hires photos looked any better than the originals when scaled up on my iPhone screen.
  I do wish they would release the code AND any related training images they used to get those results.

Wildgoose 7 years ago

Very clever. I wonder if something like this could be used for other forms of sensor data as well?

dispo001 7 years ago

Ah like, what do I look like I want to eat?

verytrivial 7 years ago

A pair of the inputs in the edge-to-edge faces are swapped. I have nagged an author.

verytrivial 7 years ago

... and I followed up with an annotated screenshot. I tried, I really did!

the8472 7 years ago

All those examples are fairly low-resolution. Does this approach scale or can it be applied in some tiled fashion? Or would the artifacts get worse for larger images?

tke248 7 years ago

Does anything like this exist on the web would like to send a blurry license plate picture through this and see what it comes up with..

kensai 7 years ago

OMG, now the "enhance" they say in investigative TV series and movies will actually be reality! :p

nathan_f77 7 years ago

This is cool, but in the comparison with Pix-to-Pix, it seems like Pix-to-Pix is the clear winner.

smrtinsert 7 years ago

"Enhance" is real. When will this stuff trickle into lower level law enforcement?

asfdsfggtfd 7 years ago

Hopefully never. This does not enhance the image - it makes up a plausible imaginary image.
EDIT: Furthermore the range of plausible imaginary images that match a given input is high (infinite?).
- pc86 7 years ago
  
  Just need to look at the picture of Fred Armisen to see that this technique can generate a picture of a plausibly real human who bears no/very little resemblance to the original image.
- smrtinsert 7 years ago
  
  Why not? A recreation that leads to an identification should be enough for a warrant that could be used for a continued investigation.
  
  asfdsfggtfd 7 years ago
  
  We could also just pick a random person off the street and punish them - it would be similarly accurate and fair (actually probably fairer - if this is trained on pictures with a certain bias it will return pictures with that bias).
  This paper does not demonstrate an enhancement technique but a phenomena which those using inverse methods called "overfitting".
TFortunato 7 years ago

Hopefully never, but I'm sure someone will see this and try!
(Because these kind of techniques aren't really enhancing the images in a way that gives you new and useful information: they are taking the low-res images as input, and giving you a plausible high-res image as output, based on it's training data. It is NOT however trying to say "this is the ACTUAL high res image that generated this low-res image"
swamp40 7 years ago

Love how Tom Cruise and Fred Armisen get transformed into completely different people...
nashashmi 7 years ago

Should it? The accuracy of the pictures were dismal.

ScoutOrgo 7 years ago

Can we use this to identify the leprechaun and find where da gold at?

yazanator 7 years ago

Is there a GitHub repository link?

avian 7 years ago

I found the title somewhat misleading. I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).

dahart 7 years ago

> I was expecting some clever application of the nearest-neighbor interpolation. But this seems to involve neural nets and appears far from "simple" to me (I'm not in the image processing field though).
It's not that far off actually, but they are talking about nearest neighbor Markov chains, not interpolation. You probably already know nearest neighbor Markov chains because there are lots of text examples, and a ton of Twitter bots that are generating random text this way. The famous historical example was the usenet post that said "I spent an interesting evening recently with a grain of salt." https://en.m.wikipedia.org/wiki/Mark_V._Shaney
This paper does use a NN to synthesize an image, which is conceptually pretty simple, even if difficult to implement well. After that they use a nearest neighbor Markov chain to fill in high frequencies. The first paper referenced is also the simplest example: http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....
That paper fills missing parts of an image using a single example, by using a Markov chain built on the nearest neighboring pixels. That paper is also one of the only image synthesis papers (or perhaps the only paper) that can synthesize readable text from an image of text. That's really cool because the inspiration was text-based Markov chains.
- jampekka 7 years ago
  
  I don't think this method has anything to do with Markov Chains. The spatial structure isn't explicitly used at all, and the interpolation/regression is quite a vanilla nearest neighbor with some performance tricks.
  Well, of course almost anything can be interpreted as a Markov process, but I don't think it's a very useful abstraction here.
  
  dahart 7 years ago
  
  > I don't think this method has anything to do with Markov Chains.
  Oh, it absolutely does. I think it's fair to say that Efros launched the field of nearest neighbor texture synthesis, and his abstract states: "The texture synthesis process grows a new image outward from an initial seed, one pixel at a time. A Markov random field model is assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.
  This is the same Markov model that all subsequent texture synthesis papers are implicitly using, including the paper at the top of this thread. Efros' paper implemented directly is really slow, so a huge number of subsequent papers use the same conceptual framework, and are only adding methods for making the method performant and practical. (Sometimes, at the cost of some quality -- many cannot synthesize text, for example.)
  Note the inspiration for text synthesis, Shannon's paper, also describes the "Markoff Process" explicitly. http://math.harvard.edu/~ctm/home/text/others/shannon/entrop... (Efros referenced Shannon, and noted on his web page: "Special thanks goes to Prof. Joe Zachary who taught my undergrad data structures course and had us implement Shannon's text synthesis program which was the inspiration for this project.")
  > Well, of course almost anything can be interpreted as a Markov process, I don't think it's a very useful abstraction here.
  It's not an abstraction to build a conditional probability table and then sample from it repeatedly to synthesize a new output. That's what a Markov process is, and that's what the paper posted here is doing. I don't really understand why you feel it's distant and abstract, but if you want to elaborate, I am willing to listen!
  
  jampekka 7 years ago
  
  Unless I horribly misread the paper, this is not based on the Efros' quilting method, which indeed uses Markov fields. The method linked here seems to interpolate every pixel independently from its surroundings (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel).
  And I didn't mean that Markov processes are abstract in any "distant" sense, but that they are an abstraction, ie a "perspective" from which to approach and formulate the problem.
  
  dahart 7 years ago
  
  I was referring to Efros' "non-parametric sampling" paper, not the quilting one. Efros defined "non-parametric sampling" as another name for "Markov chain" -- almost (see my edit below). This paper (PixelNN) refers directly to "non-parametric sampling" in the same sense as Efros, and it states that they are using "nearest neighbor" to mean "non-parametric sampling". This is talking rather explicitly about a Markov chain -like process.
  "To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach [11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through K-NN and allows for intuitive user control through on-the-fly modification of the training set..."
  Note the first reference #11 is Efros' non-parametric sampling, and that the authors state this is the "classic approach" that they apply here.
  What you call "interpolate every pixel independently from its surroundings" could be another way to describe a Markov chain, because 1: it is sampled according to the conditional probability distribution (which is what you get by using the K nearest matches.) and 2: the process is repeated - one pixel (or patch) is added using the best match, then it becomes part of the neighborhood in the search for the pixel/patch next door. The name for that is "Markov process", or in the discrete case, "Markov chain", if you take an unbiased random sample from the conditional distribution. If you always choose the best sample, then it's the same as a Markov chain, but biased.
  > (neighbor means a close-by pixel in the training set in the feature space, not a spatially close pixel)
  That's right, and that's why it's misleading to talk about nearest neighbor interpolation, because that phrase is a graphics phrase that means interpolate from spatially close pixels. Hardly anyone else calls it interpolation, they call it sampling, point sampling, and other terms.
  *EDIT:
  I'm going to relax a little bit on this. "Non-parametric sampling" is a tiny bit different from a Markov process in that a Markov process attempts to simulate a distribution in an unbiased way. By using the best match instead of a random sample from the conditional distribution, the output may produce a biased version of the original distribution. This is why it's called non-parametric sampling instead of calling it a Markov chain, but the distinction is pretty small and subtle -- texture synthesis using non parametric sampling is extremely similar to a Markov chain, but not necessarily exactly the same.
  Side note, it's really unfortunate they used the abbreviation "NN" to talk about "nearest neighbor" in a paper that also builds on "neural networks".
jampekka 7 years ago

AFAIU it actually seems to be sort of "just" a clever application of the nearest-neighbor interpolation. The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
It seems that this could be used in theory with any feature descriptors, such as local color histograms, although the results wouldn't probably be as good.
Edit: Being a nearest neighbor probably also carries the usual computational complexity problems of the method. If I understand it correctly, they ease this by actually first finding just subset of best matching full images using the CNN features and then do a local nearest neighbor search just in those images.
- tgb 7 years ago
  
  I think the confusion is that the term "nearest neighbor approach" has a different meaning in machine learning than in image interpolation.
  https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation
  versus
  https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
  
  gfredtech 7 years ago
  
  +1, the exact conclusion i came to(K-nearest neighbors) when I saw this post. Thanks for pointing this out
- dahart 7 years ago
  
  > The CNN is used to come up with the feature space for the pixels (weights of the CNN), and then each pixel is "copy-pasted" from the training set based on the nearest match.
  FWIW, what you just described is known as a "Markov process". It is sampling a known conditional probability distribution.
  While some interpolation of the data happens because the output represents a mixture of the training images, this is not "interpolation" at the pixel level, it's picking best matches from a search space of image fragments. (And the pixel neighbors are usually synthesized - the best match depends on previous best matches!) This is distinctly different from the kind of nearest neighbor interpolation you'd do when resizing an image.
  Note the phrase "nearest neighbor" in this paper has an overloaded double meaning. It is referring both to pixel neighbors and neighbors in the search space of images. The pixel neighbors provide spatial locality within a single image; this is how & why high frequencies are generated from the training set. Nearest neighbor is also referring to the neighborhood matches in the search space, the K nearest neighbors of a given pixel neighborhood are used to generate the next K pixel outputs in the synthesis phase.
SeanDav 7 years ago

Agree. This appears to be more a clever implementation of an algorithm generating "artistic" impressions. In some cases, creating artifacts which simply were not part of the original picture.
- doomlaser 7 years ago
  
  The term in neural net research is 'face hallucination': https://people.csail.mit.edu/celiu/FaceHallucination/fh.html
  Take a low resolution input image, and hallucinate a higher resolution version by statistically assembling bits from similar images in a large data set of training images.
  
  phkahler 7 years ago
  
  If anyone ever tries to use this in court I hope they call it "Face Hallucination" and not "Image Reconstruction". On the research side, I wonder what the point of this is. I find it interesting but of little practical value.
  
  aristus 7 years ago
  
  It's a way to refine their models. A systematic model-based representation of data is basically also a generator of that data.
  Why is that? Blame Kolmogorov. There are deep connections between compression, serialization, and computation. An optimal compression scheme is a serialization and the Turing-complete program to decode it. For example: you can compress pi into a few lines of algorithm plus a starting constant like 4.
sctb 7 years ago

We've updated the title from the submitted “Simple Nearest-Neighbor Approach Creates Photorealistic Image from Low-Res Image” to that of the article.
negamax 7 years ago

I was impressed that title didn't say AI..

imaginenore 7 years ago

It almost looks like they mixed training and testing data in some of the examples. The bottom-left sample in the normals-to-faces is extremely suspicions.

jj12345 7 years ago

I was looking at this as well, but I'm willing to suspend my disbelief because the normal vaguely looks like it has a good deal of information (in a basic fidelity sense).
- jameshart 7 years ago
  
  seems astonishing that the normal information includes enough detail to tell you which direction the eyes are pointing, though?

stevespang 7 years ago

So, where's the app ?

Annella 7 years ago

Thanks for sharing! Very interesting!

mlwelles 7 years ago

I noticed that all of the human examples are caucasian. I'd be very interested to see how accurate it is with more representative range of human faces than how it handles animals or handbags.

Had personal experience on a project where the facial scanning engine failed spectacularly when anyone except white men like me tried to use it.

An experience that's pretty common, too:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&c...

http://www.telegraph.co.uk/technology/2017/08/09/faceapp-spa...

https://www.theatlantic.com/technology/archive/2016/04/the-u...

https://www.theguardian.com/technology/2017/may/28/joy-buola...

sgtAtom 7 years ago

Enhance.

debuggerpk 7 years ago

Hollywood got it right!!!