Practical Exercises - IOAI Survival Kit

Practical Exercises, IOAI Survival Kit

By Costin-Alexandru Deonise

Contact: costin.deonise@upb.ro

These are "pocket-sized" versions of the patterns discussed in the talk. The goal isn't to get impressive scores, it's to feel why the technique works.

Recommended order: run each exercise in full (including the solution), then try the suggested variations at the end.

Python
1import numpy as np2import matplotlib.pyplot as plt3from sklearn import datasets, decomposition, cluster, neighbors, linear_model, svm, metrics4​5rng = np.random.default_rng(42)6%matplotlib inline

Exercise 1: "Lost in Hyperspace", feature engineering vs. raw baseline

Pattern: the data has a hidden geometric structure that a simple linear model can't "see" directly from the raw coordinates, but a single well-chosen feature makes it instantly visible.

Task: we have points placed on a circle (in 2D, hidden in a noisy space), and the label is their angle on the circle, generated in the interval (-pi, pi], exactly the interval that atan2 returns. This isn't a coincidence: a circular label always has a "break point" somewhere, and we choose to put it in the same place as the one in our feature. Otherwise we'd create an artificial discontinuity that no linear model could ever cross.

We compare:

linear regression on the raw coordinates (lazy baseline)
linear regression on a single engineered feature: the angle atan2(y, x)

Look at the difference in error and think about why the raw baseline can't "guess" the structure.

Python
1# 1. Generate synthetic data: points on a circle plus noise, the label is the angle (in radians)2n = 4003true_angle = rng.uniform(-np.pi, np.pi, size=n)4radius = 5.05X_clean = np.stack([radius * np.cos(true_angle), radius * np.sin(true_angle)], axis=1)6noise = rng.normal(scale=0.6, size=X_clean.shape)7X = X_clean + noise8y = true_angle  # what we want to predict9​10plt.figure(figsize=(4, 4))11plt.scatter(X[:, 0], X[:, 1], c=y, cmap="twilight", s=12)12plt.title("Observed points (color = true angle, in (-pi, pi])")13plt.gca().set_aspect("equal")14plt.show()

Output:

1<Figure size 400x400 with 1 Axes>

Python
1# 2. Lazy baseline: linear regression directly on (x, y)2from sklearn.model_selection import train_test_split3​4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)5​6baseline = linear_model.LinearRegression().fit(X_train, y_train)7pred_baseline = baseline.predict(X_test)8err_baseline = metrics.mean_absolute_error(y_test, pred_baseline)9print(f"Baseline error (raw coordinates):        MAE = {err_baseline:.3f} radians")10​11# 3. Engineered feature: a single number, the angle computed geometrically12feat_train = np.arctan2(X_train[:, 1], X_train[:, 0]).reshape(-1, 1)13feat_test = np.arctan2(X_test[:, 1], X_test[:, 0]).reshape(-1, 1)14​15engineered = linear_model.LinearRegression().fit(feat_train, y_train)16pred_engineered = engineered.predict(feat_test)17err_engineered = metrics.mean_absolute_error(y_test, pred_engineered)18print(f"Error with engineered feature (atan2):   MAE = {err_engineered:.3f} radians")19print(f"\n-> Error reduction: {(1 - err_engineered / err_baseline) * 100:.1f}%")

Output:

1Baseline error (raw coordinates):        MAE = 0.843 radians2Error with engineered feature (atan2):   MAE = 0.134 radians34-> Error reduction: 84.1%

Solution / what you should notice

Linear regression on raw (x, y) has a large error: a line can't approximate a circular function of two coordinates very well. A single geometric feature (atan2(y, x)), on the other hand, turns the problem into an almost perfectly linear relation (in fact, identity plus noise), and the error drops sharply, usually by 60-80%.

Why it mattered to generate the angle in (-pi, pi]: a circular label always has a "jump" somewhere, from the top edge of the interval to the bottom one. If that jump isn't in the same place as the natural jump of atan2, linear regression on the engineered feature would try to cross an artificial discontinuity, and would come out, counterintuitively, worse than the raw baseline. If you're curious how that looks in practice, switch the generation interval back to (0, 2*np.pi) and run it again, you'll see exactly that "counterintuitive" effect.

Variations to try:

Increase noise. At what noise level does the engineered feature lose its edge?
Replace LinearRegression with RandomForestRegressor on the raw data. Does it catch up to the baseline that uses the engineered feature? At what cost (training time, data needed)?
(advanced) Switch the angle generation interval back to [0, 2*pi) and watch how the engineered feature "breaks", then think about how you'd fix it without changing the interval (hint: you can recompute the label as np.arctan2(np.sin(y), np.cos(y)) to bring it into the same interval as the feature, after generating it).

Exercise 2: "Help BOBAI", an embedding router for new classes

Pattern: a frozen encoder produces representations in a shared space. New classes, never seen during training, can be recognized from just a handful of examples ("few-shot") using distance in that space (KNN or nearest-centroid), with no retraining at all.

Task: we simulate an "encoder" as a fixed random projection, a stand-in for a frozen pretrained model. We train a KNN on only 4 classes, then ask it to recognize a brand new 5th class, having seen just 3 examples of it.

Python
1# 1. "Frozen encoder" = a fixed random projection into a 16-dimensional embedding space2n_classes_seen = 43n_per_class = 604emb_dim = 165raw_dim = 66​7# class centers in the "raw" space (simulates distinct object categories)8class_centers_seen = rng.normal(scale=4.0, size=(n_classes_seen, raw_dim))9X_raw, y_seen = [], []10for c in range(n_classes_seen):11    pts = class_centers_seen[c] + rng.normal(scale=1.0, size=(n_per_class, raw_dim))12    X_raw.append(pts)13    y_seen += [c] * n_per_class14X_raw = np.vstack(X_raw)15y_seen = np.array(y_seen)16​17# "frozen encoder" = a fixed linear projection plus a nonlinearity (it never changes again)18W = rng.normal(size=(raw_dim, emb_dim))19def frozen_encoder(x):20    return np.tanh(x @ W)21​22emb_seen = frozen_encoder(X_raw)23print("Embeddings for the known classes:", emb_seen.shape)

Output:

1Embeddings for the known classes: (240, 16)

Python
1# 2. Train a simple KNN over the embeddings (NOT over the raw data!)2router = neighbors.KNeighborsClassifier(n_neighbors=5).fit(emb_seen, y_seen)3​4# 3. A NEW, unseen class shows up, so we generate a few "few-shot" examples5new_center = rng.normal(scale=4.0, size=raw_dim)6X_new_support = new_center + rng.normal(scale=1.0, size=(3, raw_dim))   # only 3 examples!7X_new_query   = new_center + rng.normal(scale=1.0, size=(20, raw_dim))  # the rest, to "test" on8​9emb_support = frozen_encoder(X_new_support)10emb_query = frozen_encoder(X_new_query)11​12# 4. The "no retraining" strategy: add the centroid of the new examples as a 5th "anchor"13new_class_id = n_classes_seen14centroid_new = emb_support.mean(axis=0, keepdims=True)15​16all_anchors = np.vstack([emb_seen, np.repeat(centroid_new, 5, axis=0)])17all_labels = np.concatenate([y_seen, np.full(5, new_class_id)])18router_extended = neighbors.KNeighborsClassifier(n_neighbors=5).fit(all_anchors, all_labels)19​20pred = router_extended.predict(emb_query)21acc_new_class = (pred == new_class_id).mean()22print(f"Share of the NEW class examples recognized correctly, having seen only 3 of them: {acc_new_class:.0%}")23print("(the base model was never retrained, only the embedding was reused)")

Output:

1Share of the NEW class examples recognized correctly, having seen only 3 of them: 95%2(the base model was never retrained, only the embedding was reused)

Solution / what you should notice

Even though the "encoder" never saw the new class, the distance in embedding space between the 3 new examples and the test examples of the same class stays small, because the encoder learned a general representation rather than one specific to the training classes. Accuracy on the new class should sit well above chance (1/5 = 20%), usually above 70%.

Variations to try:

Change n_neighbors. What happens with very few or very many neighbors?
Use nearest-centroid (distance to the centroid) instead of KNN. Is it more stable with only 3 examples?
Increase the noise scale. At what point does the router "break"?

Exercise 3: "Restroom", zero-shot matching through cosine similarity

Pattern: instead of training a classifier, you encode the image and the candidate label into the same space (the way CLIP does), and you pick the label whose vector sits "closest" to the image's, with no training examples for that category at all.

Task: we simulate a shared "image-text" space (a stand-in for CLIP) and do zero-shot matching between a handful of "items" and a list of candidate labels, using only cosine similarity.

Python
1# 1. Simulate a shared image-text space: each "concept" has an anchor vector,2#    and the images and labels tied to that concept land close to its anchor3#    (CLIP's key property: different modalities, the same space)4concepts = ["cat", "car", "mountain", "cake", "robot"]5shared_dim = 326concept_anchor = {c: rng.normal(size=shared_dim) for c in concepts}7​8def embed_near(anchor, scale=0.4):9    v = anchor + rng.normal(scale=scale, size=shared_dim)10    return v / np.linalg.norm(v)11​12# "images" (items to match), each one comes from a concept, but we don't know its label13true_concepts = rng.choice(concepts, size=8)14image_embeddings = np.stack([embed_near(concept_anchor[c]) for c in true_concepts])15​16# candidate "text labels", embeddings of the concept names in the same space17label_embeddings = {c: embed_near(concept_anchor[c], scale=0.1) for c in concepts}18label_names = list(label_embeddings.keys())19label_matrix = np.stack([label_embeddings[c] for c in label_names])

Python
1# 2. Zero-shot matching: cosine similarity between each image and every label2#    (the vectors are already normalized, so cosine similarity = dot product)3similarities = image_embeddings @ label_matrix.T   # (n_images, n_labels)4predicted_idx = similarities.argmax(axis=1)5predicted_labels = [label_names[i] for i in predicted_idx]6​7correct = sum(p == t for p, t in zip(predicted_labels, true_concepts))8print(f"{correct}/{len(true_concepts)} correct matches, with NO training examples at all\n")9for i, (true_c, pred_c) in enumerate(zip(true_concepts, predicted_labels)):10    mark = "OK " if true_c == pred_c else "X  "11    print(f"{mark} item {i}: true = {true_c:8s} | predicted (zero-shot) = {pred_c}")

Output:

18/8 correct matches, with NO training examples at all23OK  item 0: true = cat      | predicted (zero-shot) = cat4OK  item 1: true = cat      | predicted (zero-shot) = cat5OK  item 2: true = cake     | predicted (zero-shot) = cake6OK  item 3: true = cat      | predicted (zero-shot) = cat7OK  item 4: true = car      | predicted (zero-shot) = car8OK  item 5: true = mountain | predicted (zero-shot) = mountain9OK  item 6: true = car      | predicted (zero-shot) = car10OK  item 7: true = cat      | predicted (zero-shot) = cat

Solution / what you should notice

Without training any classifier, simply picking the label with the highest cosine similarity solves the task almost perfectly, as long as the shared space is well built (something CLIP guarantees through contrastive training on image-text pairs). That's exactly the "trick" behind any "Restroom"-type task: you don't need a new category in the training set, you just need a name for it, encoded in the same space.

Variations to try:

Increase scale when generating the images. At what noise level do mistakes start showing up?
Add a "trap" candidate label that's semantically very close to another one. How often does the model mix the two up?
Replace argmax with a minimum similarity threshold. How do you handle the case where "none of the labels matches well enough"?

Exercise 4: "Antique", clustering, pseudo-labels and a final classifier

Pattern: when you don't have labels, you first uncover structure through clustering, map the clusters to labels using a handful of known examples, propagate the pseudo-labels over everything else, then train a standard classifier on those pseudo-labels.

Task: unsupervised data with 3 natural groups, and we only have 5 labeled examples per group. We build the whole pipeline and measure how well it works, and what happens when the cluster-to-label mapping is wrong.

Python
1# 1. Synthetic data with 3 natural clusters (think: 3 types of objects)2X_clust, y_true = datasets.make_blobs(n_samples=300, centers=3, cluster_std=1.4,3                                       random_state=7)4label_names_4 = ["amphora", "vase", "pitcher"]5​6plt.figure(figsize=(4, 4))7plt.scatter(X_clust[:, 0], X_clust[:, 1], c=y_true, cmap="viridis", s=14)8plt.title("Unsupervised data (color shows the true group, unknown to us)")9plt.show()

Output:

1<Figure size 400x400 with 1 Axes>

Python
1# 2. Step 1: unsupervised clustering (we don't touch the labels!)2km = cluster.KMeans(n_clusters=3, n_init=10, random_state=0).fit(X_clust)3cluster_id = km.labels_4​5# 3. Step 2: map each cluster to a label, using ONLY 5 known examples per class6rng2 = np.random.default_rng(0)7known_idx = np.concatenate([rng2.choice(np.where(y_true == c)[0], size=5, replace=False)8                            for c in range(3)])9​10mapping = {}11for cl in range(3):12    members = known_idx[cluster_id[known_idx] == cl]13    if len(members) == 0:14        mapping[cl] = rng2.integers(0, 3)   # fallback, in case no known example lands here15        continue16    # the majority label among the known examples that landed in this cluster17    votes = y_true[members]18    mapping[cl] = np.bincount(votes, minlength=3).argmax()19​20print("Cluster -> label mapping:", {k: label_names_4[v] for k, v in mapping.items()})21​22# 4. Step 3: propagate the pseudo-labels everywhere23pseudo_labels = np.array([mapping[c] for c in cluster_id])24mapping_quality = (pseudo_labels == y_true).mean()25print(f"Mapping accuracy on the WHOLE set (you'd never see this in real life!): {mapping_quality:.0%}")26​27# 5. Step 4: train a final classifier on the pseudo-labels28X_train, X_test, py_train, py_test, ty_train, ty_test = train_test_split(29    X_clust, pseudo_labels, y_true, test_size=0.3, random_state=1)30​31clf = svm.SVC(kernel="rbf").fit(X_train, py_train)32acc_vs_pseudo = clf.score(X_test, py_test)33acc_vs_truth = clf.score(X_test, ty_test)34print(f"Final classifier accuracy vs. pseudo-labels: {acc_vs_pseudo:.0%}")35print(f"Final classifier accuracy vs. ground truth:  {acc_vs_truth:.0%}")

Output:

1Cluster -> label mapping: {0: 'amphora', 1: 'pitcher', 2: 'vase'}2Mapping accuracy on the WHOLE set (you'd never see this in real life!): 100%3Final classifier accuracy vs. pseudo-labels: 100%4Final classifier accuracy vs. ground truth:  99%

Solution / what you should notice

If the cluster structure reflects the real categories well, the mapping learned from just 5 examples per class is enough to propagate correct labels over hundreds of points, and the final classifier ends up close to the accuracy you'd have gotten with all the labels. But take a look at discussion question 11: if the mapping gets a single correspondence wrong (cluster X mapped to the wrong label), the error propagates systematically through everything, and the final classifier learns "with full confidence" exactly that mistake.

Variations to try:

Intentionally force a wrong mapping (swap two entries in mapping) and run it again. How much does acc_vs_truth drop? Does acc_vs_pseudo stay just as high?
Increase cluster_std. At what point do the clusters start overlapping and the mapping becomes unreliable?
Replace KMeans with AgglomerativeClustering or SpectralClustering. Does the quality change?

Exercise 5: "Concepts", strict-schema generation (guided decoding, simulated)

Pattern: when you need guaranteed-valid output from a generative model (JSON, labels from a fixed list, a strict format), you constrain the generation process itself instead of hoping the model "lands on" the right format.

Task: we don't have a real LLM here, so we simulate the idea directly: a noisy "generator" produces free-form text. We compare (a) naive parsing of that free output with (b) a guided generation function that always picks from the valid set, exactly the role that Pydantic validation plus re-prompting would play in a real pipeline.

Python
1import re2import json3​4# Our strict schema: { "category": one of [...], "score": an integer from 1 to 5 }5VALID_CATEGORIES = ["tool", "toy", "food", "vehicle"]6​7# 1. A noisy "generator", simulating an LLM that occasionally goes off script8templates_buggy = [9    '{{"category": "{cat}", "score": {score}}}',10    "I think it's a {cat}, I'd give it a score of {score}.",     # falls outside the schema11    '{{"category": "{cat}", "score": "{score} out of 5"}}',      # wrong type for the score12    '{{"category": "{cat_bad}", "score": {score}}}',             # made-up category13]14​15def noisy_generate(seed):16    r = np.random.default_rng(seed)17    cat = r.choice(VALID_CATEGORIES)18    cat_bad = cat + "_extra"  # a category outside the schema19    score = int(r.integers(1, 6))20    template = r.choice(templates_buggy, p=[0.4, 0.25, 0.2, 0.15])21    return template.format(cat=cat, cat_bad=cat_bad, score=score)22​23samples = [noisy_generate(i) for i in range(10)]24print("Sample of raw output (free generation):")25for s in samples:26    print(" ", s)

Output:

1Sample of raw output (free generation):2  {"category": "vehicle", "score": 4}3  {"category": "toy_extra", "score": 3}4  {"category": "vehicle", "score": 2}5  {"category": "vehicle", "score": 1}6  I think it's a food, I'd give it a score of 5.7  {"category": "food", "score": "5 out of 5"}8  {"category": "toy", "score": 3}9  {"category": "vehicle_extra", "score": 4}10  {"category": "food_extra", "score": 2}11  {"category": "toy", "score": 5}

Python
1# 2a. Naive parsing: just try json.loads directly2def parse_naive(text):3    try:4        obj = json.loads(text)5        assert obj["category"] in VALID_CATEGORIES6        assert isinstance(obj["score"], int) and 1 <= obj["score"] <= 57        return obj8    except Exception:9        return None10​11naive_results = [parse_naive(s) for s in samples]12naive_valid = sum(r is not None for r in naive_results)13print(f"Naive parsing:    {naive_valid}/{len(samples)} responses valid against the schema")14​15# 2b. GUIDED generation: instead of parsing whatever comes out, we restrict the choices to the valid set16#     (the practical equivalent of a grammar, or a schema validator that re-requests fields)17def guided_generate(seed):18    r = np.random.default_rng(seed)19    return {"category": str(r.choice(VALID_CATEGORIES)), "score": int(r.integers(1, 6))}20​21guided_results = [guided_generate(i) for i in range(10)]22guided_valid = sum(parse_naive(json.dumps(r)) is not None for r in guided_results)23print(f"Guided generation: {guided_valid}/{len(guided_results)} responses valid against the schema")

Output:

1Naive parsing:    5/10 responses valid against the schema2Guided generation: 10/10 responses valid against the schema

Solution / what you should notice

Naive parsing fails often: the generator's free-form output is valid as plain text, but structurally invalid (made-up categories, wrong types, extra text around the JSON). Guided generation, on the other hand, leaves no room for structural error, because it picks directly from a predefined set, exactly what a decoder constrained by a grammar or a JSON schema does on top of a real LLM.

That's also the trade-off behind discussion question 14: we gained 100% validity, but we also narrowed the model's "voice" down to a fixed list. In a real setting, the schema needs to be wide enough to let the model express its correct understanding of the task.

Variations to try:

Add a fifth valid category to the schema, but not to templates_buggy. How does the naive validation rate change?
Write a repair(text) function that tries to "fix" invalid responses (for example, pull the number out of a string like "4 out of 5") before rejecting them. How much does the success rate go up?

Quick recap

Exercise	Central pattern	Associated discussion question
1. Lost in Hyperspace	a good feature beats a bigger model	Q10
2. Help BOBAI	a router over embeddings, no retraining	Q7, Q8
3. Restroom	zero-shot matching through similarity	Q9
4. Antique	clustering, pseudo-labels, classifier	Q11
5. Concepts	strict-schema generation	Q14

Homework

The four exercises below revisit the patterns from the notebook, but ask you to write new code rather than just tweak a parameter. There's no single correct solution here, the point is to practice the reflex of "read the constraints, pick the pattern, build a quick baseline".

1. "Mirror World" (a 3D variant of Lost in Hyperspace)
Generate points on the surface of a sphere (instead of a circle), with noise, where the label is the latitude (the angle from the pole). Build a baseline (regression on the raw x, y, z coordinates) and an engineered feature (hint: arccos(z / vector norm)). How much does the error drop? Does it work as well as in 2D, or do new complications show up?

2. "New Species" (a variant with two new classes at once)
Start from the router in Exercise 2, but add TWO new classes at the same time, each with only 2-3 examples. Does the router mix them up with each other more often than it mixes them up with the old classes? Build a confusion matrix and explain what you observe.

3. "Mystery Boxes" (choosing the number of clusters without "cheating")
Run the pipeline from Exercise 4 again, but this time pick the number of clusters k without looking at y_true, using the silhouette score or the "elbow" method. Compare the final result (acc_vs_truth) when you pick k correctly versus when you pick k wrong by plus or minus 1. How much does the choice of k actually matter?

4. "Strict Diary" (your own guided-decoding schema)
Design a JSON schema for a different domain, say a "daily journal entry": {"mood": one of [...], "hours_slept": an integer from 0 to 12, "activities": a list of strings}. Write a noisy generator, a naive parsing function, and a guided generation function for your schema, then report the validation rates. What changes compared to Exercise 5 once you have a list-type field instead of just scalar ones?

How you know you're done well: for each homework item, you should be able to answer in 2-3 sentences the question "what pattern did I practice, and why did it work (or not)?", the same way we did in the solutions above.