I have a database with
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
The in-text-position of character names and locations are saved in the corresponding tables.
Now I want to write a Python script using psycopg2 to find all occurrences of a given character name and a given location in books, where both occur.
At the moment, I execute 4 queries:
SELECT bookID, position FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnames'
SELECT DISTINCT bookID FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnamesIDs'
SELECT bookID, position FROM locations WHERE locName='YYY';
--> result is saved in list 'locs'
SELECT bookID FROM locations WHERE locName='YYY';
--> result is saved in list 'locsIDs'
Both queries could give me bookIDs where just the name OR the location appears. So my goal is to eliminate all elements of 'charnames' with bookIDs not occuring in 'locs' and the other way round. My approach was:
for cnameTuple in charnames:
~if cnameTuple[0] in locsIDs:
~~continue
~del(cname)
I made a corresponding loop for the tuples in locs.
This algorithm unfortunately needs a lot of time. Is there a way to perform this task quicker?
This could be much faster and simpler with a query using JOINs.
Something like this:
SELECT b.*, c.position, l.position
FROM books b
JOIN characternames c USING (bookid)
JOIN locations l USING (bookid)
WHERE c.name = 'XXX'
AND l.locname = 'YYY';
It is generally a good idea to only retrieve the data you need from the database server instead of fetching excessive data and filtering in your app. Database engines are optimized for filtering and sorting. You can hardly ever match that in your app to begin with.
Note how I use all lower case names. Read the manual about identifiers.
The manual on the SELECT statement, in particular about JOINs.
"Thousands of books" are no problem at all for a RDBMS like PostgreSQL that is designed to handle millions. The key to performance with large tables are proper indexes. For the queries here the following indexes will potentially help:
CREATE INDEX books_bookid_idx ON books(bookid); -- a primary key will do, too
CREATE INDEX cn_bookid_idx ON characternames (bookid);
CREATE INDEX cn_name_idx ON characternames (name);
CREATE INDEX locations_bookid_idx ON locations (bookid);
CREATE INDEX locations_locname_idx ON locations (locname);
Multicolumn indexes may perform even better. Test with EXPLAIN ANALYZE
, it will show you which indexes get used and how fast the query is. Creating indexes is very fast, experimenting with them is easy. Just don't keep indexes you don't need. They carry a maintenance cost, too.
I think
I understand now, what you are looking for. This query should be optimized to get all positions of locations or names per bookid
, but only where name and location show up in the same book, and no further details per book:
WITH b AS (
SELECT bookid
FROM characternames
WHERE name = 'XXX'
GROUP BY 1
INTERSECT
SELECT bookid
FROM locations
WHERE l.locname = 'YYY'
GROUP BY 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
WITH
query) makes sure the base query is only executed once.INTERSECT
picks only bookids
that feature both location and name.UNION ALL
in the final SELECT
returns all found positions. Use UNION
instead if you want to trim duplicates with the same position.bookid, position
- guessing that is what's needed.what
to tag the source (location or name) of a position.If search terms appear many times per book you could considerably speed up the search by creating auxiliary tables with distinct entries for (bookid, term)
. Create a multicolumn primary index on the two columns and an additional one on just term
. Create one such table for locations and another one for names. Keep them up to date with triggers if need should be, but I assume the content of books is not changing much. Would simplify and speed up the CTE.
If that still isn't fast enough, look into Full Text Search.
4/10/20
4/10/24
4/18/20
4/18/24
I get every (name/locname) pair per bookID, so there's also redundant data - EarlGrey 2012-04-10 05:25
characterNames
or from locations
? An array of all positions? Try sample data to show what you want, if you have trouble expressing it clearly. But edit your question for that. Don't put it in comments - Erwin Brandstetter 2012-04-10 05:34
b
. Should be faster. But really, you should look into upgrading. Version 9.1 is so much better than v8.3 - which is reaching end of life soon - Erwin Brandstetter 2012-04-12 01:33
You can use set to see
if it speeds up the operation
>>> xxx = set([(1,'a'), (2,'b')])
>>> xxx
set([(1, 'a'), (2, 'b')])
>>> xxx = set([(1,'a'), (3,'c')])
>>> yyy
set([(1, 'a'), (3, 'c')])
>>> c = xxx.intersection(yyy)
>>> c
set([(1, 'a')]) # common between xxx and yyy
>>> xxx - c
set([(2, 'b')])