Last week I joined a webinar on fine-tuning a vision foundation model to detect cancer in pathology slides. I'm not a pathologist. I can't read a slide, and a good chunk of the biology went over my head. But the machine-learning shape of the problem is something any ML researcher can follow, and by the end I could rebuild the pipeline on paper. This is that explanation, written for people like me who work in AI but not in medicine.
First, what a pathology slide even is
When a doctor removes a piece of tissue, a lab stains it (usually with H&E, which turns cell nuclei purple and other structures pink) and mounts it on glass. A pathologist looks at it under a microscope to judge whether the cells look cancerous.
To bring AI in, the slide gets scanned into a digital file called a whole-slide image, or WSI. The first surprise: these files are enormous. A single scanned slide can run to 100,000 by 100,000 pixels. That is a gigapixel image, hundreds of times bigger than anything a standard vision model takes as input. You cannot feed a whole slide into a network the way you feed it a photo of a cat.
Figure 1 — A gigapixel slide is cut into thousands of small tiles (patches) before any model sees it.
Figure 2 — An H&E-stained whole-slide image. Illustrative; source: # Automated Tumour Detection in Whole Slide Images: An End-to-End Deep Learning Pipeline (https://balintstewart77.github.io/camelyon16-pathology/)
The gigapixel problem, and the weak-label twist
The workaround is tiling. You chop the giant slide into thousands of small patches, often 256 by 256 pixels, and treat each patch as an image the model can handle. One slide becomes ten thousand little pictures. Before that, teams run a quick tissue-detection step to throw away the blank glass, and often a stain-normalization step (methods with names like Macenko and Vahadane) so slides from different labs don't look wildly different in color.
Tiling solves the size problem and creates a new one. You now have ten thousand patches per slide, and for most of them, nobody has told you which contain the cancer. The label you actually hold sits at the level of the whole slide: this patient has cancer, this one doesn't. The needle is somewhere in the haystack, and you were handed only the fact that a needle exists. In ML terms this is weak supervision, and it drives the whole design.
What the foundation model brings
Here is where the vision foundation model, or VFM, comes in, and where the webinar clicked for me.
A pathology VFM is a large vision transformer already trained on an immense pile of unlabeled patches. Virchow, one of the well-known ones, is a 632-million-parameter model trained on roughly 1.5 million whole-slide images with self-supervised learning (the DINOv2 approach from general computer vision). UNI, Prov-GigaPath, and CHIEF are other examples. Self-supervised means it learned the visual structure of tissue with nobody labeling cancer versus benign, the same way a language model learns from raw text.
The payoff: this model already knows what tissue looks like. Hand it a patch and it returns a compact numerical fingerprint, an embedding, that captures the meaningful content. You didn't teach it cells, staining, or texture from scratch. Someone spent enormous compute doing that once and released the weights.
Fine-tuning: you train very little
This reframes the task, and it's the part most relevant to non-medical ML people. You are not building a cancer detector from zero. You are adapting a model that already sees tissue clearly.
The webinar laid out three levels of effort:
The lightest and most common approach freezes the foundation model completely. You run every patch through it once, collect the embeddings, and discard the pixels. Then you train a small aggregator that takes the bag of patch embeddings from one slide and produces a single slide-level prediction. Because you only have slide-level labels, this aggregator is a multiple-instance-learning head with attention: it learns which patches deserve attention and downweights the rest. The attention scores come free, and they show you where on the slide the model is looking.
A middle option adds linear probing or small adapters on top, still keeping the backbone mostly frozen.
The heaviest option fine-tunes the foundation model's own weights, usually with a parameter-efficient method like LoRA so you aren't updating all 632 million of them. It costs the most compute and needs the most labeled data. The presenters' honest take: most teams don't need it. The frozen-encoder-plus-attention-head route gets you far, and full fine-tuning mainly pays off when you have a lot of clean, task-specific data.
Figure 3 — The common pipeline: the frozen VFM (blue) turns each patch into an embedding; a small attention-based aggregator (green, the only part you train) combines them into one slide-level call.
Where to get data without a hospital
One relief for outsiders: you don't need a hospital to start. Several large pathology datasets are public and openly licensed. Camelyon16/17 covers breast-cancer lymph-node slides, PANDA covers prostate, and TCGA spans many cancer types. They come with slide-level labels, which is exactly what the weak-supervision pipeline expects. OpenSlide is the standard library for reading these gigantic files.
The part they spent the most time on: not the model, the validation
This surprised me, and it's the most transferable lesson. The presenters spent less time on architecture than on how you check the result, because this is where pathology models quietly fail.
The headline numbers look great. Virchow reported an AUC around 0.949 for detecting cancer across seventeen tissue types, and held up on rarer ones. AUC measures how well the model separates positive from negative cases, where 1.0 is perfect and 0.5 is a coin flip, so 0.949 ranks a cancer slide above a healthy one almost every time.
Then came the warnings. Split your data by patient and by hospital, never by random tiles, or patches from the same slide leak between train and test and your score becomes fiction. Validate on slides from a hospital the model never trained on, because a different scanner and a different lab's staining can look foreign enough to fool it. Report more than one AUC: sensitivity, specificity, and especially the false-negative rate at a high-sensitivity operating point, because in cancer, missing a positive is far worse than flagging an extra slide for review. And don't over-trust the pretty attention heatmap; a pathologist has to confirm the highlighted regions are biologically sensible, because a model can land on an artifact or a smudge and still score well.
What I took away as a non-medical practitioner
Strip out the biology and the pipeline is familiar. A giant image gets tiled into patches. A pretrained foundation model turns each patch into an embedding. A small model learns from weak, slide-level labels to combine those embeddings into a diagnosis, and its attention doubles as an explanation a doctor can inspect. The hard part isn't the model; it's proving the model works on slides it has never seen.
The recipe travels well past cancer. Any domain with enormous images and scarce, coarse labels (satellite imagery, industrial inspection, materials science) can borrow it directly. The foundation model does the seeing. You do the smaller, more careful work of teaching it what to decide, and the even more careful work of checking that it decided for the right reasons.
One line the webinar kept returning to, which I'll pass on: none of this replaces the pathologist. The realistic goal is a second reader that flags suspicious slides and points to where it's worried, so a human spends attention where it counts. If you build in this space, that framing matters as much as the AUC.
I walked in unable to read a slide, and I still can't. But I walked out able to build the pipeline and, more usefully, able to tell a good result from a fragile one.



