Practical Exercises, IOAI Survival Kit
By Costin-Alexandru Deonise
Contact: costin.deonise@upb.ro
These are "pocket-sized" versions of the patterns discussed in the talk. The goal isn't to get impressive scores, it's to feel why the technique works.
Recommended order: run each exercise in full (including the solution), then try the suggested variations at the end.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasets, decomposition, cluster, neighbors, linear_model, svm, metricsrng = np.random.default_rng(42)%matplotlib inlineExercise 1: "Lost in Hyperspace", feature engineering vs. raw baseline
Pattern: the data has a hidden geometric structure that a simple linear model can't "see" directly from the raw coordinates, but a single well-chosen feature makes it instantly visible.
Task: we have points placed on a circle (in 2D, hidden in a noisy space), and the label is their angle on the circle, generated in the interval (-pi, pi], exactly the interval that atan2 returns. This isn't a coincidence: a circular label always has a "break point" somewhere, and we choose to put it in the same place as the one in our feature. Otherwise we'd create an artificial discontinuity that no linear model could ever cross.
We compare:
- linear regression on the raw coordinates (lazy baseline)
- linear regression on a single engineered feature: the angle
atan2(y, x)
Look at the difference in error and think about why the raw baseline can't "guess" the structure.
# 1. Generate synthetic data: points on a circle plus noise, the label is the angle (in radians)n = 400true_angle = rng.uniform(-np.pi, np.pi, size=n)radius = 5.0X_clean = np.stack([radius * np.cos(true_angle), radius * np.sin(true_angle)], axis=1)noise = rng.normal(scale=0.6, size=X_clean.shape)X = X_clean + noisey = true_angle # what we want to predictplt.figure(figsize=(4, 4))plt.scatter(X[:, 0], X[:, 1], c=y, cmap="twilight", s=12)plt.title("Observed points (color = true angle, in (-pi, pi])")plt.gca().set_aspect("equal")plt.show()Output:
<Figure size 400x400 with 1 Axes># 2. Lazy baseline: linear regression directly on (x, y)from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)baseline = linear_model.LinearRegression().fit(X_train, y_train)pred_baseline = baseline.predict(X_test)err_baseline = metrics.mean_absolute_error(y_test, pred_baseline)print(f"Baseline error (raw coordinates): MAE = {err_baseline:.3f} radians")# 3. Engineered feature: a single number, the angle computed geometricallyfeat_train = np.arctan2(X_train[:, 1], X_train[:, 0]).reshape(-1, 1)feat_test = np.arctan2(X_test[:, 1], X_test[:, 0]).reshape(-1, 1)engineered = linear_model.LinearRegression().fit(feat_train, y_train)pred_engineered = engineered.predict(feat_test)err_engineered = metrics.mean_absolute_error(y_test, pred_engineered)print(f"Error with engineered feature (atan2): MAE = {err_engineered:.3f} radians")print(f"\n-> Error reduction: {(1 - err_engineered / err_baseline) * 100:.1f}%")Output:
Baseline error (raw coordinates): MAE = 0.843 radiansError with engineered feature (atan2): MAE = 0.134 radians-> Error reduction: 84.1%Solution / what you should notice
Linear regression on raw (x, y) has a large error: a line can't approximate a circular function of two coordinates very well. A single geometric feature (atan2(y, x)), on the other hand, turns the problem into an almost perfectly linear relation (in fact, identity plus noise), and the error drops sharply, usually by 60-80%.
Why it mattered to generate the angle in (-pi, pi]: a circular label always has a "jump" somewhere, from the top edge of the interval to the bottom one. If that jump isn't in the same place as the natural jump of atan2, linear regression on the engineered feature would try to cross an artificial discontinuity, and would come out, counterintuitively, worse than the raw baseline. If you're curious how that looks in practice, switch the generation interval back to (0, 2*np.pi) and run it again, you'll see exactly that "counterintuitive" effect.
Variations to try:
- Increase
noise. At what noise level does the engineered feature lose its edge? - Replace
LinearRegressionwithRandomForestRegressoron the raw data. Does it catch up to the baseline that uses the engineered feature? At what cost (training time, data needed)? - (advanced) Switch the angle generation interval back to
[0, 2*pi)and watch how the engineered feature "breaks", then think about how you'd fix it without changing the interval (hint: you can recompute the label asnp.arctan2(np.sin(y), np.cos(y))to bring it into the same interval as the feature, after generating it).
Exercise 2: "Help BOBAI", an embedding router for new classes
Pattern: a frozen encoder produces representations in a shared space. New classes, never seen during training, can be recognized from just a handful of examples ("few-shot") using distance in that space (KNN or nearest-centroid), with no retraining at all.
Task: we simulate an "encoder" as a fixed random projection, a stand-in for a frozen pretrained model. We train a KNN on only 4 classes, then ask it to recognize a brand new 5th class, having seen just 3 examples of it.
# 1. "Frozen encoder" = a fixed random projection into a 16-dimensional embedding spacen_classes_seen = 4n_per_class = 60emb_dim = 16raw_dim = 6# class centers in the "raw" space (simulates distinct object categories)class_centers_seen = rng.normal(scale=4.0, size=(n_classes_seen, raw_dim))X_raw, y_seen = [], []for c in range(n_classes_seen): pts = class_centers_seen[c] + rng.normal(scale=1.0, size=(n_per_class, raw_dim)) X_raw.append(pts) y_seen += [c] * n_per_classX_raw = np.vstack(X_raw)y_seen = np.array(y_seen)# "frozen encoder" = a fixed linear projection plus a nonlinearity (it never changes again)W = rng.normal(size=(raw_dim, emb_dim))def frozen_encoder(x): return np.tanh(x @ W)emb_seen = frozen_encoder(X_raw)print("Embeddings for the known classes:", emb_seen.shape)Output:
Embeddings for the known classes: (240, 16)# 2. Train a simple KNN over the embeddings (NOT over the raw data!)router = neighbors.KNeighborsClassifier(n_neighbors=5).fit(emb_seen, y_seen)# 3. A NEW, unseen class shows up, so we generate a few "few-shot" examplesnew_center = rng.normal(scale=4.0, size=raw_dim)X_new_support = new_center + rng.normal(scale=1.0, size=(3, raw_dim)) # only 3 examples!X_new_query = new_center + rng.normal(scale=1.0, size=(20, raw_dim)) # the rest, to "test" onemb_support = frozen_encoder(X_new_support)emb_query = frozen_encoder(X_new_query)# 4. The "no retraining" strategy: add the centroid of the new examples as a 5th "anchor"new_class_id = n_classes_seencentroid_new = emb_support.mean(axis=0, keepdims=True)all_anchors = np.vstack([emb_seen, np.repeat(centroid_new, 5, axis=0)])all_labels = np.concatenate([y_seen, np.full(5, new_class_id)])router_extended = neighbors.KNeighborsClassifier(n_neighbors=5).fit(all_anchors, all_labels)pred = router_extended.predict(emb_query)acc_new_class = (pred == new_class_id).mean()print(f"Share of the NEW class examples recognized correctly, having seen only 3 of them: {acc_new_class:.0%}")print("(the base model was never retrained, only the embedding was reused)")Output:
Share of the NEW class examples recognized correctly, having seen only 3 of them: 95%(the base model was never retrained, only the embedding was reused)Solution / what you should notice
Even though the "encoder" never saw the new class, the distance in embedding space between the 3 new examples and the test examples of the same class stays small, because the encoder learned a general representation rather than one specific to the training classes. Accuracy on the new class should sit well above chance (1/5 = 20%), usually above 70%.
Variations to try:
- Change
n_neighbors. What happens with very few or very many neighbors? - Use
nearest-centroid(distance to the centroid) instead of KNN. Is it more stable with only 3 examples? - Increase the noise
scale. At what point does the router "break"?
Exercise 3: "Restroom", zero-shot matching through cosine similarity
Pattern: instead of training a classifier, you encode the image and the candidate label into the same space (the way CLIP does), and you pick the label whose vector sits "closest" to the image's, with no training examples for that category at all.
Task: we simulate a shared "image-text" space (a stand-in for CLIP) and do zero-shot matching between a handful of "items" and a list of candidate labels, using only cosine similarity.
# 1. Simulate a shared image-text space: each "concept" has an anchor vector,# and the images and labels tied to that concept land close to its anchor# (CLIP's key property: different modalities, the same space)concepts = ["cat", "car", "mountain", "cake", "robot"]shared_dim = 32concept_anchor = {c: rng.normal(size=shared_dim) for c in concepts}def embed_near(anchor, scale=0.4): v = anchor + rng.normal(scale=scale, size=shared_dim) return v / np.linalg.norm(v)# "images" (items to match), each one comes from a concept, but we don't know its labeltrue_concepts = rng.choice(concepts, size=8)image_embeddings = np.stack([embed_near(concept_anchor[c]) for c in true_concepts])# candidate "text labels", embeddings of the concept names in the same spacelabel_embeddings = {c: embed_near(concept_anchor[c], scale=0.1) for c in concepts}label_names = list(label_embeddings.keys())label_matrix = np.stack([label_embeddings[c] for c in label_names])# 2. Zero-shot matching: cosine similarity between each image and every label# (the vectors are already normalized, so cosine similarity = dot product)similarities = image_embeddings @ label_matrix.T # (n_images, n_labels)predicted_idx = similarities.argmax(axis=1)predicted_labels = [label_names[i] for i in predicted_idx]correct = sum(p == t for p, t in zip(predicted_labels, true_concepts))print(f"{correct}/{len(true_concepts)} correct matches, with NO training examples at all\n")for i, (true_c, pred_c) in enumerate(zip(true_concepts, predicted_labels)): mark = "OK " if true_c == pred_c else "X " print(f"{mark} item {i}: true = {true_c:8s} | predicted (zero-shot) = {pred_c}")Output:
8/8 correct matches, with NO training examples at allOK item 0: true = cat | predicted (zero-shot) = catOK item 1: true = cat | predicted (zero-shot) = catOK item 2: true = cake | predicted (zero-shot) = cakeOK item 3: true = cat | predicted (zero-shot) = catOK item 4: true = car | predicted (zero-shot) = carOK item 5: true = mountain | predicted (zero-shot) = mountainOK item 6: true = car | predicted (zero-shot) = carOK item 7: true = cat | predicted (zero-shot) = catSolution / what you should notice
Without training any classifier, simply picking the label with the highest cosine similarity solves the task almost perfectly, as long as the shared space is well built (something CLIP guarantees through contrastive training on image-text pairs). That's exactly the "trick" behind any "Restroom"-type task: you don't need a new category in the training set, you just need a name for it, encoded in the same space.
Variations to try:
- Increase
scalewhen generating the images. At what noise level do mistakes start showing up? - Add a "trap" candidate label that's semantically very close to another one. How often does the model mix the two up?
- Replace
argmaxwith a minimum similarity threshold. How do you handle the case where "none of the labels matches well enough"?
Exercise 4: "Antique", clustering, pseudo-labels and a final classifier
Pattern: when you don't have labels, you first uncover structure through clustering, map the clusters to labels using a handful of known examples, propagate the pseudo-labels over everything else, then train a standard classifier on those pseudo-labels.
Task: unsupervised data with 3 natural groups, and we only have 5 labeled examples per group. We build the whole pipeline and measure how well it works, and what happens when the cluster-to-label mapping is wrong.
# 1. Synthetic data with 3 natural clusters (think: 3 types of objects)X_clust, y_true = datasets.make_blobs(n_samples=300, centers=3, cluster_std=1.4, random_state=7)label_names_4 = ["amphora", "vase", "pitcher"]plt.figure(figsize=(4, 4))plt.scatter(X_clust[:, 0], X_clust[:, 1], c=y_true, cmap="viridis", s=14)plt.title("Unsupervised data (color shows the true group, unknown to us)")plt.show()Output:
<Figure size 400x400 with 1 Axes># 2. Step 1: unsupervised clustering (we don't touch the labels!)km = cluster.KMeans(n_clusters=3, n_init=10, random_state=0).fit(X_clust)cluster_id = km.labels_# 3. Step 2: map each cluster to a label, using ONLY 5 known examples per classrng2 = np.random.default_rng(0)known_idx = np.concatenate([rng2.choice(np.where(y_true == c)[0], size=5, replace=False) for c in range(3)])mapping = {}for cl in range(3): members = known_idx[cluster_id[known_idx] == cl] if len(members) == 0: mapping[cl] = rng2.integers(0, 3) # fallback, in case no known example lands here continue # the majority label among the known examples that landed in this cluster votes = y_true[members] mapping[cl] = np.bincount(votes, minlength=3).argmax()print("Cluster -> label mapping:", {k: label_names_4[v] for k, v in mapping.items()})# 4. Step 3: propagate the pseudo-labels everywherepseudo_labels = np.array([mapping[c] for c in cluster_id])mapping_quality = (pseudo_labels == y_true).mean()print(f"Mapping accuracy on the WHOLE set (you'd never see this in real life!): {mapping_quality:.0%}")# 5. Step 4: train a final classifier on the pseudo-labelsX_train, X_test, py_train, py_test, ty_train, ty_test = train_test_split( X_clust, pseudo_labels, y_true, test_size=0.3, random_state=1)clf = svm.SVC(kernel="rbf").fit(X_train, py_train)acc_vs_pseudo = clf.score(X_test, py_test)acc_vs_truth = clf.score(X_test, ty_test)print(f"Final classifier accuracy vs. pseudo-labels: {acc_vs_pseudo:.0%}")print(f"Final classifier accuracy vs. ground truth: {acc_vs_truth:.0%}")Output:
Cluster -> label mapping: {0: 'amphora', 1: 'pitcher', 2: 'vase'}Mapping accuracy on the WHOLE set (you'd never see this in real life!): 100%Final classifier accuracy vs. pseudo-labels: 100%Final classifier accuracy vs. ground truth: 99%Solution / what you should notice
If the cluster structure reflects the real categories well, the mapping learned from just 5 examples per class is enough to propagate correct labels over hundreds of points, and the final classifier ends up close to the accuracy you'd have gotten with all the labels. But take a look at discussion question 11: if the mapping gets a single correspondence wrong (cluster X mapped to the wrong label), the error propagates systematically through everything, and the final classifier learns "with full confidence" exactly that mistake.
Variations to try:
- Intentionally force a wrong mapping (swap two entries in
mapping) and run it again. How much doesacc_vs_truthdrop? Doesacc_vs_pseudostay just as high? - Increase
cluster_std. At what point do the clusters start overlapping and the mapping becomes unreliable? - Replace
KMeanswithAgglomerativeClusteringorSpectralClustering. Does the quality change?
Exercise 5: "Concepts", strict-schema generation (guided decoding, simulated)
Pattern: when you need guaranteed-valid output from a generative model (JSON, labels from a fixed list, a strict format), you constrain the generation process itself instead of hoping the model "lands on" the right format.
Task: we don't have a real LLM here, so we simulate the idea directly: a noisy "generator" produces free-form text. We compare (a) naive parsing of that free output with (b) a guided generation function that always picks from the valid set, exactly the role that Pydantic validation plus re-prompting would play in a real pipeline.
import reimport json# Our strict schema: { "category": one of [...], "score": an integer from 1 to 5 }VALID_CATEGORIES = ["tool", "toy", "food", "vehicle"]# 1. A noisy "generator", simulating an LLM that occasionally goes off scripttemplates_buggy = [ '{{"category": "{cat}", "score": {score}}}', "I think it's a {cat}, I'd give it a score of {score}.", # falls outside the schema '{{"category": "{cat}", "score": "{score} out of 5"}}', # wrong type for the score '{{"category": "{cat_bad}", "score": {score}}}', # made-up category]def noisy_generate(seed): r = np.random.default_rng(seed) cat = r.choice(VALID_CATEGORIES) cat_bad = cat + "_extra" # a category outside the schema score = int(r.integers(1, 6)) template = r.choice(templates_buggy, p=[0.4, 0.25, 0.2, 0.15]) return template.format(cat=cat, cat_bad=cat_bad, score=score)samples = [noisy_generate(i) for i in range(10)]print("Sample of raw output (free generation):")for s in samples: print(" ", s)Output:
Sample of raw output (free generation): {"category": "vehicle", "score": 4} {"category": "toy_extra", "score": 3} {"category": "vehicle", "score": 2} {"category": "vehicle", "score": 1} I think it's a food, I'd give it a score of 5. {"category": "food", "score": "5 out of 5"} {"category": "toy", "score": 3} {"category": "vehicle_extra", "score": 4} {"category": "food_extra", "score": 2} {"category": "toy", "score": 5}# 2a. Naive parsing: just try json.loads directlydef parse_naive(text): try: obj = json.loads(text) assert obj["category"] in VALID_CATEGORIES assert isinstance(obj["score"], int) and 1 <= obj["score"] <= 5 return obj except Exception: return Nonenaive_results = [parse_naive(s) for s in samples]naive_valid = sum(r is not None for r in naive_results)print(f"Naive parsing: {naive_valid}/{len(samples)} responses valid against the schema")# 2b. GUIDED generation: instead of parsing whatever comes out, we restrict the choices to the valid set# (the practical equivalent of a grammar, or a schema validator that re-requests fields)def guided_generate(seed): r = np.random.default_rng(seed) return {"category": str(r.choice(VALID_CATEGORIES)), "score": int(r.integers(1, 6))}guided_results = [guided_generate(i) for i in range(10)]guided_valid = sum(parse_naive(json.dumps(r)) is not None for r in guided_results)print(f"Guided generation: {guided_valid}/{len(guided_results)} responses valid against the schema")Output:
Naive parsing: 5/10 responses valid against the schemaGuided generation: 10/10 responses valid against the schemaSolution / what you should notice
Naive parsing fails often: the generator's free-form output is valid as plain text, but structurally invalid (made-up categories, wrong types, extra text around the JSON). Guided generation, on the other hand, leaves no room for structural error, because it picks directly from a predefined set, exactly what a decoder constrained by a grammar or a JSON schema does on top of a real LLM.
That's also the trade-off behind discussion question 14: we gained 100% validity, but we also narrowed the model's "voice" down to a fixed list. In a real setting, the schema needs to be wide enough to let the model express its correct understanding of the task.
Variations to try:
- Add a fifth valid category to the schema, but not to
templates_buggy. How does the naive validation rate change? - Write a
repair(text)function that tries to "fix" invalid responses (for example, pull the number out of a string like"4 out of 5") before rejecting them. How much does the success rate go up?
Quick recap
| Exercise | Central pattern | Associated discussion question |
|---|---|---|
| 1. Lost in Hyperspace | a good feature beats a bigger model | Q10 |
| 2. Help BOBAI | a router over embeddings, no retraining | Q7, Q8 |
| 3. Restroom | zero-shot matching through similarity | Q9 |
| 4. Antique | clustering, pseudo-labels, classifier | Q11 |
| 5. Concepts | strict-schema generation | Q14 |
Homework
The four exercises below revisit the patterns from the notebook, but ask you to write new code rather than just tweak a parameter. There's no single correct solution here, the point is to practice the reflex of "read the constraints, pick the pattern, build a quick baseline".
1. "Mirror World" (a 3D variant of Lost in Hyperspace)
Generate points on the surface of a sphere (instead of a circle), with noise, where the label is the latitude (the angle from the pole). Build a baseline (regression on the raw x, y, z coordinates) and an engineered feature (hint: arccos(z / vector norm)). How much does the error drop? Does it work as well as in 2D, or do new complications show up?
2. "New Species" (a variant with two new classes at once)
Start from the router in Exercise 2, but add TWO new classes at the same time, each with only 2-3 examples. Does the router mix them up with each other more often than it mixes them up with the old classes? Build a confusion matrix and explain what you observe.
3. "Mystery Boxes" (choosing the number of clusters without "cheating")
Run the pipeline from Exercise 4 again, but this time pick the number of clusters k without looking at y_true, using the silhouette score or the "elbow" method. Compare the final result (acc_vs_truth) when you pick k correctly versus when you pick k wrong by plus or minus 1. How much does the choice of k actually matter?
4. "Strict Diary" (your own guided-decoding schema)
Design a JSON schema for a different domain, say a "daily journal entry": {"mood": one of [...], "hours_slept": an integer from 0 to 12, "activities": a list of strings}. Write a noisy generator, a naive parsing function, and a guided generation function for your schema, then report the validation rates. What changes compared to Exercise 5 once you have a list-type field instead of just scalar ones?
How you know you're done well: for each homework item, you should be able to answer in 2-3 sentences the question "what pattern did I practice, and why did it work (or not)?", the same way we did in the solutions above.