One of my favorite types of articles to create it where I leap into something new and share the results of this endeavor with the rest of the community. Recently, I found out that it was entirely possible to run some of the “AI” image manipulation tools locally and not have to deal with the vagarities of a web-hosted service. More interestingly, I could make adjustments to the interior to suit the type of output I was looking for.

Knowing absolutely nothing, I immediately asked the internet for instructions. The internet, being the internet, gave me guides full of lies. Well, that’s uncharitable. The fact of the matter is the technology is changing rapidly, being in its formative years. So the guides I first worked from had been accurate – but were so out of date that they referenced files and sites that were already gone, replaced by the new. This still resulted in hours of frustration and annoyance as I tried to salvage the garbled setup that had been birthed by these outdated guides. This is why I won’t link to the exact guide I ultimately used, as it may soon be out of date, and totally not because I forgot to record which one it was, no siree.

All of the iterations I tried to get working were some variant of the Stable Diffusion setup? Engine? Application? I can’t call it a model, because models have a very specific meaning within the software, and none of the sites really cared to be as precise as I’d have liked. So for a lot of things, I started my internet journey anew, looking for specific answers to questions.

The first version? Distro? I’ll call it a distro. The first Distro of Stable Diffusion I got running was Automatic1111, which was recommended for beginners. I was a beginner, so I thought it might be good for me. I was wrong, as I disliked the interface. I felt like I kept breaking it, or causing things to happen when I was just trying to get a tooltip. So, I moved on and instead installed the other web frontend for Stable Diffusion most commonly referenced – ComfyUI. ComfyUI has the distinction of a much steeper learning curve at the frontend, but once you get used to it, it is easier to see and control the flow of logic for the image processing. Instead of the form style interface of Automatic1111, ComfyUI uses a field of nodes, which you plug together to set up a workflow. These nodes each do one specific thing, which helped me get at some of the concepts easier than arguing with a static-looking web page.

Click for fuller image

I don’t have a picture of the default workflow as it comes out of the box, so I cleaned up my workflow to a more basic state. I left in nodes that are not currently connected in the lower left, which can be freely ignored for the time being. If you can’t trace a route that ends up at the “Save Image” node on the far right, it’s not in the logic for the active workflow. This is set up to generate an image from the loaded model using prompts. Nothing fancy, just spit out a new image. Since we’re talking concepts, I’ll at least go over the connected nodes, and why they’re there.

The workflow starts always at the checkpoint loader. That is the node at the far left. A checkpoint is a set of tagged data that associates visual elements to English keywords. You can only have one checkpoint in a workflow. You can’t start from more than one place. Of course, a checkpoint alone might not give you exactly what you’re looking for, and being able to supplement it is useful. So, the idea of LoRAs were added. LoRA stands for “Low Rank Adaptation”, it is a set of data that adjusts the checkpoint, skewing it towards a particular visual style, or giving more weight to matching certain keywords with features, or adding new keywords all together. It is fine-tuning for the checkpoint. If you add too many LoRAs, you will end up with poorer quality images as the additional skewing of weights pulls the results hither and yon. The combined checkpoint as LoRAs is the model. The model is used as a database of image elements for later processing. I most often have a LoRA called Dreamshaper in my workflow because it skews towards a digital painting style rather than photoreal. The photoreal images are too uncanny valley for me, so the quirks of the computer generated remix stand out more than if it looks like a piece of artwork where errors are easier to gloss over mentally.

After the “Load LoRA” node, our simple workflow splits between two “Text Encode” nodes and a “KSampler”. Since everything feeds back into the Sampler node, we’ll start with the Text Encode boxes. These are where you enter the Prompts for the image processing engine to pull data out of the model. For the most part these are English words or phrases to be interpreted by the machine, hopefully making something that doesn’t look like nightmare fuel. There are two boxes, corresponding to the prompt and the negative prompt. The software tries to match elements from the prompt and tries to avoid elements from the negative prompt. Because it isn’t actually intelligent, it doesn’t always succeed. You can give an element more or less weight by adding a : with a number after it at the end of a keyword. With one being normal, these values are usually one point something or zero point something.

The sampler box has a whole bunch of connections and configuration options. That’s because the sampler is where the work gets done. It takes in our model and the prompts we just discussed, but it also takes in this new input called “latent_image”. At the moment, we won’t be messing around with the latent image, but it does come into play for the logic. In this setup we’re using an Empty Latent Image node to feed the sampler, because we’re not using an existing image as an input for processing. If I take the output from the empty latent image node, skip the sampler and just decode it, you get a brown field of nothing.

A blank latent image.

This really doesn’t matter unless you have the value of the last option in the sampler set wrong. “Denoise” is very poorly named as a configuration element. It tells the sampler how much of the latent image feed it is allowed to ignore. At a value of one, the latent image is ignored and the sampler is free to assemble whatever image it likes. We only adjust the denoise setting when we’re making changes to an existing image to adjust how faithful it is to the source image. So we’ll come back to that later.

The output from the sampler feeds into this “VAE Decode” node that takes an input from a “VAE Loader”. So what is a VAE? According to the internet it is a “Variational Autoencoder” and has something to do with filtering out junk and improving the image quality. To be honest, I don’t really understand what the VAE does. I found this orangemix one referenced in a good looking image online, and after I switched the default to it, things looked less bad, so I kept it and haven’t explored why it works.

Lastly, the “Save Image” box gives us a preview of the output and writes a copy to disk in the “outputs” folder of your ComfyUI installation. So, we’ve reviewed everything we’ve gone and set up for basic image generation.

As it stands if we hit the “Queue Prompt” button…

There are no prompts and it produces something very random.

It'll sell for $5.5 Million, right?

I suppose this might be called “artistic” by modern schools.

We have to ask it to make something. Lets make a blond fantasy knight on foot somewhere rural. The software is, well, software, so it’s dumb as a box of rocks, and will do exactly what it is told, even if that’s not what you meant to tell it to do. Combined with my own inexperience, some of these elements are cargo cult engineering level of “do this and get that”. I’m going to start with an unrefined prompt of “knight in fantasy armor, sleek, highly detailed, digital painting, realistic digital painting, outdoors, rural, bright, day, looking at viewer, caucasian, blond, green eyes, male, masculine, 1man”.

The negative prompts are also important to the equation. These are things we don’t want the software to draw for us. And to be honest, I don’t know what some of these notations do. But we’ll use “low contrast, sketch, watercolor, bad hands, photo, deformed, black and white, disfigured, modern,((blurry)), animated, cartoon, duplicate, child, childish, (worst quality:2), (low quality:2), (normal quality:2), lowres, bad anatomy, normal quality, ((monochrome)), ((grayscale)), ((text, font, logo, copyright, watermark:1)), easynegative, badhandv4, missing limb, wrong number of limbs, bad proportions”

Lets hit “Queue Prompt”.

With my hardware and the settings currently in use, it’s rather quick.

It’s doing something…

First attempt.

Eh… it’s okay.

Lets refine some prompts. The colors are too bright, so lets move bright down to the negative prompt and replace it with “soft outdoor lighting”. I’m also going to add some prompts “masterpiece, best quality, smooth gradients, detailed face, realistic skin tone, youthful, strong”. Just telling the sampler you want a masterpiece won’t make it so, but it will skew towards pulling elements that humans training the model tagged as such. Same thing with the “Best quality” and “worst quality” sort of tags. While it may feel silly to me to have to tell it that, I remind myself that it’s not intelligent, it’s got chips for brains and just follows orders.

So we kick it again with new prompts… and it doesn’t improve.

Second attempt

What is bugging me is the way it’s drawing skin tones, and with the checkpoint I picked for this example, we’re not going to get it to go away with just generic prompts. You see, the checkpoint I chose was optimized for use with a particular sampler. The sampler node has been set to uni_pc this whole time. I don’t recall if that is a default or if I’d been messing around and switched to that. However, the creator of the checkpoint noted that it worked best with DPM++ 2M. In ComfyUI, that shows up as dpmpp_2m. So when I rerun the last set of prompts with the sampler changed to match the checkpoint documentation, I get something with less exaggerated rouge.

With different sampler

The art style still makes him look too young for the battlefield, but that’s baked into the checkpoint in question. You can only get out of the machine remixer what you put in. There’s a pile of checkpoints and LoRAs out there, some with very narrow data sets. You only get out what you put in. And that’s actually where the ethical quandary comes in. There’s no way for me, as the end user, to be sure that whoever trained these models only used images they had the right to use. Human artists are rightfully upset that their hard work is getting fed into some of these remix machines without their knowledge or permission. Morally, I’m on the side of the artists, and for book covers, I’m going to go pay someone to draw something new. This dalliance is just as a toy and a learning experience.

As this article is getting a bit long, we’ll look at reprocessing an existing image in another installment.

PostScript – After some more futzing about, I managed to create This image, ironically by adding in a LoRA intended to create Anime characters. ÂŻ\_(ツ)_/ÂŻ

PostPostScript – As a show of my own ignorance, I’ve recently realized that DreamShaper was not a LoRA, but a checkpoint, so trying to load it as a LoRA does nothing. It’s effect was entirely placebo. I’m a tad embarassed, but will admit that I am far from an expert. Though I have decoded some of the magic words in the prompts. I’ll cover them in a future installment.