I'm n1 - I read, write and code.

Posts

pdf

Django and manual data prefetch

Django supports prefething related data based on 1:N or M:N relations. This technique works well and saves database from frequent N+1 queries while all the data are prefetched at once and then kept in memory. But what if you have a complex class - let's say a availability calculator - that operates on many set's of data. Such datasets are related one to each other and you work with them in a bunch of loops. Is it better to fetch all the data at once? Or in batches? Or don't prefetch data at all? Let's look onto this.

Data model

Let's say you have 3 entities

  • cars
  • trips
  • defects

A car can make a trip. Cars are checked regularly and can have some defects which mechanics need to repair.

Now let's say you have a transportation company that owns 100 cars and each car makes 100 trips a month. Also, a car usually has 5 arbitrary defects a week. All of that creates an interesting dataset that with 100k trips and 2k defects a month.

Availability calculator

Let's say we need some type of calculator that takes a car or a list of cars and calculates if the car(s) are operable or not. You don't wanna broken cars transport cargo and this calculator helps you with that.

The calculator needs to work with all 3 types of entities. We know that there is 100 * 100 trips per month. Also 5 * 4 defects per car fer month. Now it's time to think about how to fetch inside the calculator class in regard later data manipulation.

All-at-once fetch

One method how to fetch data is to fetch cars and prefetch trips and defects.

Car.objects.prefetch_related("trip_set", "defect_set").all()

That would give you pretty much all you need but there are some downsides:

  • data might be big than expected thus ORM mapping might be slow
  • data are nested - trips and defects are hidden under a car
  • you cannot work with all defects and trips as a whole

The last point is especially inconvenient. You might need to query trips as a whole over all cars. That's impossible if data are nested. The same is true for defects.

Fetch in batches

Quite unusual (or one might think an oldschool) technique is to fetch all the needed cars, trips and defects. The thing here is to re-store the records in memory under a key that you are going to use the most - probably car ID.

from collections import defaultdict


cars = []  # list of car IDs
trips = defaultdict(list)

for t in Trip.objects.filter(car__in=cars):
    trips[t.car.pk].append(t)

This small piece of code gives you the possibility to fetch all trips for all the cars at once + gives you the flexibility having them all in one heap yet you can still identify then by it's relation. You can do the same for defects. To query trips or defects is simple:

car_trips = trips[car.pk]

To fetch some specific trips/defects for a car or even multiple cars I recommend using filter():

filter(trips, lamdda t: t.destination = "London")
filter(defects[car.pk], lamdda d: d.severity = "light")