I have a property column which can have a subset of the following values at any point in time: { a | b | c | d | e }
. By this, I mean that sometimes it can be any of { a | d | e }
, or at another time it can even be { x | y | z }
. How do I query the datastore, so that I can find out what subset is present at that point in time, without having to dig into each entity?
Presently I'm doing it this way:
people = Person.all().fetch(100)
city = set()
for p in people:
city.add(p.address)
I want to get the set of property values that are present at this point in time (i.e. no duplicates). For example, at one point in time all 5,000,000 people
have an address
of { Manila | Cebu | Davao }
, then I want the set(Manila, Cebu, Davao)
.
At another point in time, all 5,000,000 people
will have an address
of { Iloilo | Laoag }
, then I want the set(Iloilo, Laoag)
.
Before any query, I would not know what the set
should compose of.
My present method requires that I dig through all the entities. It's terribly inefficient, any better way?
In AppEngine, it's almost always better to generate and store what you might need, during write time.
So in your use case, every time you add or edit a person entity, you add the city they are in to another model that lists all the cities, and then store that cities entity as well.
class Cities(db.Model):
list_of_cities = db.TextProperty(default="[]") #we'll use a stringified json list of cities
#when creating a new person / or when editing
person = Person(city = city)
cities = Cities.all().get() #there's only one model that we'll use.
list_of_cities = simplejson.loads(cities.list_of_cities)
if city not in list_of_cities:
list_of_cities.append(city) #add to the list of cities
cities.list_of_cities = simplejson.dumps(list_of_cities)
db.put(cities)
person.put()
You may want to use memcache on your cities entity to speed things up a bit. If you are also expecting to add more than one person in bursts of more than 1 write / second, then you might need to also consider sharding your list of cities.
An alternative to the approach suggested by Albert is to compute these values periodically using a mapreduce. The App Engine Mapreduce library makes this fairly straightforward. Your mapper will output the city (for instance) for each record, while the reducer will output the value and the number of times it occurs for each.