Effective query with multiple conditions - 【StackMirror】|python|postgresql|psycopg2

I have a database with

books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

The in-text-position of character names and locations are saved in the corresponding tables.
Now I want to write a Python script using psycopg2 to find all occurrences of a given character name and a given location in books, where both occur.
At the moment, I execute 4 queries:

SELECT bookID, position FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnames'

SELECT DISTINCT bookID FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnamesIDs'

SELECT bookID, position FROM locations WHERE locName='YYY';
--> result is saved in list 'locs'

SELECT bookID FROM locations WHERE locName='YYY';
--> result is saved in list 'locsIDs'

Both queries could give me bookIDs where just the name OR the location appears. So my goal is to eliminate all elements of 'charnames' with bookIDs not occuring in 'locs' and the other way round. My approach was:

for cnameTuple in charnames:  
~if cnameTuple[0] in locsIDs:  
~~continue  
~del(cname)

I made a corresponding loop for the tuples in locs.
This algorithm unfortunately needs a lot of time. Is there a way to perform this task quicker?

2012-04-05 22:03
by EarlGrey

I rolled back the last edits, as those went into this separate question now (as advised) - Erwin Brandstetter 2012-04-23 01:39

This could be much faster and simpler with a query using JOINs.
Something like this:

SELECT b.*, c.position, l.position
FROM   books b
JOIN   characternames c USING (bookid)
JOIN   locations l USING (bookid)
WHERE  c.name = 'XXX'
AND    l.locname = 'YYY';

It is generally a good idea to only retrieve the data you need from the database server instead of fetching excessive data and filtering in your app. Database engines are optimized for filtering and sorting. You can hardly ever match that in your app to begin with.
Note how I use all lower case names. Read the manual about identifiers.
The manual on the SELECT statement, in particular about JOINs.

More info after comment

"Thousands of books" are no problem at all for a RDBMS like PostgreSQL that is designed to handle millions. The key to performance with large tables are proper indexes. For the queries here the following indexes will potentially help:

CREATE INDEX books_bookid_idx ON books(bookid); -- a primary key will do, too

CREATE INDEX cn_bookid_idx ON characternames (bookid);
CREATE INDEX cn_name_idx ON characternames (name);

CREATE INDEX locations_bookid_idx ON locations (bookid);
CREATE INDEX locations_locname_idx ON locations (locname);

Multicolumn indexes may perform even better. Test with EXPLAIN ANALYZE, it will show you which indexes get used and how fast the query is. Creating indexes is very fast, experimenting with them is easy. Just don't keep indexes you don't need. They carry a maintenance cost, too.

Optimized query

I think I understand now, what you are looking for. This query should be optimized to get all positions of locations or names per bookid, but only where name and location show up in the same book, and no further details per book:

WITH b AS (
    SELECT bookid
    FROM   characternames
    WHERE  name = 'XXX'
    GROUP  BY 1
    INTERSECT
    SELECT bookid
    FROM   locations
    WHERE  l.locname = 'YYY'
    GROUP  BY 1
    )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'
UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

Major points

The CTE (WITH query) makes sure the base query is only executed once.
INTERSECT picks only bookids that feature both location and name.
The UNION ALL in the final SELECT returns all found positions. Use UNION instead if you want to trim duplicates with the same position.
I order by bookid, position - guessing that is what's needed.
Added a column what to tag the source (location or name) of a position.

Further optimization

If search terms appear many times per book you could considerably speed up the search by creating auxiliary tables with distinct entries for (bookid, term). Create a multicolumn primary index on the two columns and an additional one on just term. Create one such table for locations and another one for names. Keep them up to date with triggers if need should be, but I assume the content of books is not changing much. Would simplify and speed up the CTE.

If that still isn't fast enough, look into Full Text Search.

2012-04-05 23:36
by Erwin Brandstetter

Hello and sorry I hadn't responsed yet.
Using JOIN was my first approach, but I was told it's to expensive. I'm working on a corpus of thousands of books, so a cross join can take its time. (I may be wrong, but your solution is sort of a cross join with additional conditions. Feel free to correct me, if I'm wrong.) I'm fine with this, but is there a possibility, to get all positions just once? As a (bookID/c.position/l.position) tuple, I get e.g. 4/10/20 4/10/24 4/18/20 4/18/24 I get every (name/locname) pair per bookID, so there's also redundant data - EarlGrey 2012-04-10 05:25

@EarlGrey: A cross join would be a join without join condition. There is no cross join here. And while a join has its cost, it is guaranteed to be faster by an order of magnitude than what you had in your question. Now, if you clearly define which position per book and which other data with you want, I'll be able to supply the query. The first? Last? From characterNames or from locations? An array of all positions? Try sample data to show what you want, if you have trouble expressing it clearly. But edit your question for that. Don't put it in comments - Erwin Brandstetter 2012-04-10 05:34

I took a guess at what I think you might need - Erwin Brandstetter 2012-04-10 06:36

+1 for outstanding effort in answering. Take all of my upvotes - Li-aung Yip 2012-04-10 06:54

Yes that's exactly what I need. Sorry I hadn't made it clear enough. Thank you very much for your help, both on SQL and asking on stackoverflow - EarlGrey 2012-04-10 06:54

I really like your solution and want to use it, but the server I'm working on runs PSQL 8.3. CTE was implemented since PSQL 8.4. Without updating PSQL, I have to substitute 'b' every time by myself, right - EarlGrey 2012-04-12 01:15

@EarlGrey: You could use a temporary table for b. Should be faster. But really, you should look into upgrading. Version 9.1 is so much better than v8.3 - which is reaching end of life soon - Erwin Brandstetter 2012-04-12 01:33

Updating to 9.1 is no possible option at the moment, therefore I'm using temporary tables and it's fast enough for now - EarlGrey 2012-04-22 12:16

I expanded the question for more than 2 conditions. Maybe you can have a look again? I would be happy ^ - EarlGrey 2012-04-22 16:17

@EarlGrey: I suppose you start a new question for the new question. Link back to this one if the context is useful. It's confusing for the general public to find multiple questions in one post and answers for different stages of the question. BTW, it's CTE for Common Table Expression - Erwin Brandstetter 2012-04-22 16:31

@ErwinBrandstetter: Ok, I asked the question here - EarlGrey 2012-04-23 01:04

+1 for effort and enduranc - EarlGrey 2012-04-23 01:51

You can use set to see if it speeds up the operation

>>> xxx = set([(1,'a'), (2,'b')])
>>> xxx
set([(1, 'a'), (2, 'b')])
>>> xxx = set([(1,'a'), (3,'c')])
>>> yyy
set([(1, 'a'), (3, 'c')])
>>> c = xxx.intersection(yyy)
>>> c
set([(1, 'a')])   # common between xxx and yyy
>>> xxx - c
set([(2, 'b')])

2012-04-05 22:42
by Anthony Kong