Optimizing Active Record queries
Context: Some time ago, I worked on a project where I had to make a lot of reports. We had a lot of data, and most of the reports should be in the application’s dashboard. The data was divided into more than one table, so I had to make queries joining many tables every time I wanted to generate a new report. That made our application slow and even sometimes broke.
In this article, I’ll show you how I used Active Record to fix the app’s problems and if there’s a limit on how to customize Active Record queries.
The sample app I’ll use will be a clone (in a base form) of Spotify or any music app. Let me explain how will be the tables relationship in our database.
--- title: DB Schema --- erDiagram songs ||--|{ ratings : "" songs { integer id string title integer duration integer progress } albums ||--|{ ratings : "" albums { integer id string title } artists ||--|{ ratings : "" artists { integer id string name string age } users ||--|{ ratings : "" users { integer id string name } songs ||--|{ associations : "" albums ||--|{ associations : "" artists ||--|{ associations : "" associations { integer song_id integer artist_id integer album_id } ratings { integer id string votable_type integer votable_id integer user_id integer vote }
All the songs
belong to an artist
and an album
. We also have the users
table who can vote or rate a song or album. The ratings
table has a polymorphic relation to songs
and albums
.
NOTE: A polymorphic relation is how Rails associate with more than one table simultaneously without the need to create a foreign key with the name of the association.
Before Start
¿How will we know if these solutions are fast?
There are a lot of ways to analyze your code. In this article, I’ll make with these 3 ways:
- Elapsed time
- Memory: space in the disk that the code occupies in execution
- Iterations per second
To have these 3 ways to measure our code, we will use the Benchmark
module inside Ruby and the gems benchmark-ips
and benchmark-memory
.
First Problem: Get the highest rating albums
Possible Solutions
First solution:
Using a combination of Active Record
and Ruby
methods:
Album.includes(:ratings).map { |album| album.ratings.length }.max
Second solution:
Using only Active Record
methods
Album.from(
Album.select(
"albums.*", "COUNT(ratings.id) AS rating_count"
).joins(:ratings).group("albums.id")
).maximum("rating_count")
Both return the same result: 20. What is the difference?
Let’s analyze the first solution
The first part of this solution uses an Active Record
method:
Album.includes(:ratings)
This expression executes the following SQL code:
SELECT "albums".* FROM "albums"
SELECT "ratings".* FROM "ratings" WHERE "ratings"."ratingable_type" = "Album"
AND "ratings"."ratingable_id" IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...)
The more albums we have in our database, the longer the ranking list we have.
The second part is
.map { |album| album.ratings.length }.max
The map
method iterates over the result of the includes
method we did before. Inside every album, we call all the ratings associated with that album, and we calculate (with Ruby) how many ratings have every album. Once we have the list with all the totals, with the max
method of the Enumerators
in Ruby, we can calculate the element with the max value inside the array.
Let’s analyze the second solution
Let’s break the solution in two parts, the first is inside the from
method.
Album.select(
"albums.*", "COUNT(ratings.id) AS rating_count"
).joins(:ratings).group("albums.id")
This will execute the following SQL code:
SELECT albums.*, COUNT(ratings.id) AS rating_count
FROM "albums" INNER JOIN "ratings"
ON "ratings"."votable_type" = "Album" AND "ratings"."votable_id" = "albums"."id"
GROUP BY "albums"."id"
Here we add the COUNT(rating.id)
column with the alias rating_count
to our query to count how many associated ratings have every album.
The second part is:
Album.from(...).maximum("rating_count")
This will execute the following SQL code:
SELECT MAX(rating_count) FROM (...) subquery
The first part of the query we explained becomes the subquery on which we will run another query. The query will get the maximum value of the rating_count
column.
Let’s compare the solutions
Elapsed Time This is the code we will execute:
Benchmark.bmbm do |x|
x.report("Active Record + Ruby code") { first_solution }
x.report("Only Active Record") { second_solution }
end
Note: I’m using the
bmbm
method to execute the reports twice, so none of the methods will be affected, only for being the first.
This is the result:
Rehearsal -------------------------------------------------------------
Active Record + Ruby code 5.554467 0.788396 6.342863 ( 7.404660)
Only Active Record 0.003204 0.001894 0.005098 ( 0.741973)
---------------------------------------------------- total: 6.347961sec
user system total real
Active Record + Ruby code 5.042479 0.540590 5.583069 ( 5.776371)
Only Active Record 0.001858 0.000419 0.002277 ( 0.532489)
If we analyze the result, there is a slight time difference between the two solutions, with the second one being faster.
Memory This is the code we will execute:
Benchmark.memory do |x|
x.report("Active Record + Ruby code") { first_solution }
x.report("Only Active Record") { second_solution }
x.compare!
end
This is the result:
Calculating -------------------------------------
Active Record + Ruby code
585.804M memsize ( 0.000 retained)
6.099M objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
Only Active Record 53.270k memsize ( 0.000 retained)
511.000 objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
Comparison:
Only Active Record: 53270 allocated
Active Record + Ruby code: 585803866 allocated - 10996.88x more
If we look at this result, we can say that the first solution uses 1.54 times more memory than the second solution.
Iterations per second This is the code we will execute:
Benchmark.ips do |x|
x.report("Active Record + Ruby code") { first_solution }
x.report("Only Active Record") { second_solution }
x.compare!
end
This is the result:
Warming up --------------------------------------
Active Record + Ruby code
1.000 i/100ms
Only Active Record 1.000 i/100ms
Calculating -------------------------------------
Active Record + Ruby code
0.159 (± 0.0%) i/s - 1.000 in 6.277377s
Only Active Record 12.768 (± 7.8%) i/s - 64.000 in 5.032453s
Comparison:
Only Active Record: 12.8 i/s
Active Record + Ruby code: 0.2 i/s - 80.15x (± 0.00) slower
As we see in the results, the first solution is 1.44 times faster than the second one.
Results of comparison
- The second solution is faster than the first solution at elapsed time.
- In terms of memory, the first solution uses 10996x times more memory than the second solution.
- And on the number of iterations per second, the first solution is 80.15 times slower than the second solution.
So why is the second solution the best? Because we are doing only one call on the database, and the whole operation is done from the DB side.
Second problem: Top 10 songs with their artists that have the highest rating
Possible solutions
This time I have 3 solutions:
First solution
Using Active Record
and Ruby
ratings = Rating
.where(votable_type: "Song")
.group(:votable_id).average(:vote)
.sort_by { |r| -r[1] }.take(n).to_h
songs = Song.includes(:artists).find(ratings.keys)
ratings.map do |song_id, rating|
song = songs.find { |song| song.id == song_id }
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: rating.to_f
}
end
Second solution
Using only one query with a lot of Active Record
methods
Song
.includes(:artists).joins(:ratings)
.select("songs.*, AVG(ratings.vote) as rating_avg")
.group("songs.id").order("rating_avg DESC").limit(n)
.map do |song|
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: song.rating_avg
}
end
Third solution
Using 2 queries with Active Record
methods and Ruby
code
ratings = Rating
.select("ratings.votable_id, AVG(ratings.vote) as rating_avg")
.where(votable_type: "Song")
.group(:votable_id).order("rating_avg DESC")
.limit(n)
Song
.includes(:artists)
.find(ratings.map(&:votable_id))
.map do |song|
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: ratings.find { |rating| rating.votable_id == song.id }.rating_avg
}
end
Let’s analyze the first solution
To calculate the songs with the highest rating, I first search in the ratings
table for the ratings related to the songs table.
Remember that the ratings
table has a polymorphic relationship with songs
, so we must look for those ratings whose votable_type
is Song
, so we will know which ratings are songs. Then we will group them by votable_id
(song.id) to obtain the average of all the ratings.
Rating
.where(votable_type: "Song")
.group(:votable_id)
.average(:vote)
This expression will return an array of arrays with the following structure: [[votable_id, vote_avg], [votable_id, vote_avg]]
. It has all the songs’ ids and the average of their ratings.
It will sort them by the second element of the array in descending order, then take the first ‘n’ elements and convert it into a hash.
.sort_by { |r| -r[1] }.take(n).to_h
The second part will get the songs of the hash we just generated, including the associated artists.
songs = Song.includes(:artists).find(ratings.keys)
Lastly, we iterate over the hash of the ratings to generate the required data. In every iteration, we search for the song which belongs to the rating, and we build the hash to return as a result.
ratings.map do |song_id, rating|
song = songs.find { |song| song.id == song_id }
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: rating.to_f
}
end
Let’s analyze the second solution
I tried to get the same result in this solution by doing only one query to the database.
First, I’ll make a query over the joins
of the tables songs
and ratings
, so I can have the data from both tables. I’ll use the includes(:artists)
method to call all the associated artists to the found songs.
Song.includes(:artists).joins(:ratings)
Over the joins
, we will ask for the average (in SQL) grouped by song.id
, then sort them by rating_avg
and at the end call only the first n
records.
.select("songs.*, AVG(ratings.vote) as rating_avg")
.group("songs.id")
.order("rating_avg DESC")
.limit(n)
Finally, we will iterate over the result to build the hash we should return as a result
.map do |song|
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: song.rating_avg
}
end
Let’s analyze the third solution
We make something similar to the first solution, but we don’t use the average
method from ActiveRecord
. Otherwise, we will do it by SQL using the select
method. We call the group
method to make the calculation for every song, then we filter the search to only songs with the where
method, and lastly, we sort them and limit the result to the n
first records
ratings = Rating
.select("ratings.votable_id, AVG(ratings.vote) as rating_avg")
.where(votable_type: "Song")
.group(:votable_id)
.order("rating_avg DESC")
.limit(n)
Then, we search the songs with those ratings we found, and we format them to result in the expected result:
Song
.includes(:artists)
.find(ratings.map(&:votable_id))
.map do |song|
{
song: song.title,
artist: song.artists.map(&:name).join(", "),
rating_avg: ratings.find { |rating| rating.votable_id == song.id }.rating_avg
}
end
Let’s compare the solutions
Elapsed Time
Rehearsal ----------------------------------------------------------------
Active Record + Ruby code 0.787713 0.196014 0.983727 ( 1.645864)
Only Active Record 0.016700 0.002986 0.019686 ( 0.567646)
Active Record + Ruby code v2 0.024173 0.000953 0.025126 ( 0.403639)
------------------------------------------------------- total: 1.028539sec
user system total real
Active Record + Ruby code 0.595617 0.026087 0.621704 ( 1.016591)
Only Active Record 0.008948 0.000577 0.009525 ( 0.580103)
Active Record + Ruby code v2 0.002977 0.000137 0.003114 ( 0.378619)
Memory
Calculating -------------------------------------
Active Record + Ruby code
175.136M memsize ( 0.000 retained)
2.422M objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
Only Active Record 303.235k memsize ( 0.000 retained)
3.425k objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
Active Record + Ruby code v2
347.840k memsize ( 0.000 retained)
3.615k objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
Comparison:
Only Active Record: 303235 allocated
Active Record + Ruby code v2: 347840 allocated - 1.15x more
Active Record + Ruby code: 175135967 allocated - 577.56x more
Iterations per second
Warming up --------------------------------------
Active Record + Ruby code
1.000 i/100ms
Only Active Record 1.000 i/100ms
Active Record + Ruby code v2
1.000 i/100ms
Calculating -------------------------------------
Active Record + Ruby code
0.947 (± 0.0%) i/s - 5.000 in 5.294872s
Only Active Record 1.867 (± 0.0%) i/s - 10.000 in 5.357158s
Active Record + Ruby code v2
2.646 (± 0.0%) i/s - 14.000 in 5.292725s
Comparison:
Active Record + Ruby code v2: 2.6 i/s
Only Active Record: 1.9 i/s - 1.42x (± 0.00) slower
Active Record + Ruby code: 0.9 i/s - 2.79x (± 0.00) slower
Results of comparison
- The elapsed time of the third solution is faster than the two first solutions.The first solution is slower than the other two.
- The third solution uses 1.15 times more memory than the second solution, and the first solution uses 557 times more memory than the second solution.
- And on the number of iterations per second, the second solution is 1.42 times slower than the third solution and the first solution 2.79 times slower than the third solution.
In conclusion, the third solution is the best, and the first solution is the least optimal. Why is the solution using pure Active Record the least optimal this time? Because the query we are doing makes joins of two tables, this has an optimization cost, as we can observe.
Third Problem: Top “N” songs with the highest rating and pass as parameters if you want their artists and albums
Possible Solutions:
- Build the result by calling different methods based on the parameters.
- Using SQL y Materialized views.
SQL Views
- They are virtual tables created from a
SELECT
query that usually joins multiple tables. - Each time they are called, the query they have been defined will be executed.
Materialized views
- They are also virtual tables created from a
SELECT
query that normally joins multiple tables. - They store the query results in this virtual table. They are read-only.
- They can be updated whenever you need them.
Let’s compare the solutions
I’m using the best solution from the previous problem to compare with the SQL and Materialized views.
Elapsed Time
Rehearsal -------------------------------------------------------------
Active Record + Ruby code 0.096419 0.036263 0.132682 ( 0.598458)
SQL View 0.000640 0.000259 0.000899 ( 0.001701)
Materialized View 0.000671 0.000172 0.000843 ( 0.001408)
---------------------------------------------------- total: 0.134424sec
user system total real
Active Record + Ruby code 0.008866 0.001029 0.009895 ( 0.382414)
SQL View 0.000122 0.000003 0.000125 ( 0.000121)
Materialized View 0.000100 0.000000 0.000100 ( 0.000098)
Memory
Calculating -------------------------------------
Active Record + Ruby code
347.036k memsize ( 0.000 retained)
3.611k objects ( 0.000 retained)
50.000 strings ( 0.000 retained)
SQL View 1.992k memsize ( 0.000 retained)
29.000 objects ( 0.000 retained)
1.000 strings ( 0.000 retained)
Materialized View 1.992k memsize ( 0.000 retained)
29.000 objects ( 0.000 retained)
1.000 strings ( 0.000 retained)
Comparison:
SQL View: 1992 allocated
Materialized View: 1992 allocated - same
Active Record + Ruby code: 347036 allocated - 174.21x more
Iterations per second
Warming up --------------------------------------
Active Record + Ruby code
1.000 i/100ms
SQL View 10.036k i/100ms
Materialized View 10.049k i/100ms
Calculating -------------------------------------
Active Record + Ruby code
2.652 (± 0.0%) i/s - 14.000 in 5.284182s
SQL View 99.936k (± 1.1%) i/s - 501.800k in 5.021873s
Materialized View 100.465k (± 0.7%) i/s - 502.450k in 5.001495s
Comparison:
Materialized View: 100465.2 i/s
SQL View: 99935.8 i/s - same-ish: difference falls within error
Active Record + Ruby code: 2.7 i/s - 37887.16x (± 0.00) slower
Results of comparison
- The elapsed time of the second and third solution are faster than the first one.
- The first solution uses 174.21 times more memory than the second and third solutions.
- And on the number of iterations per second, the first solution is 37887.16 times slowerd than the two first solutions.
- In conclusion, the difference between the optimized solution from the previous problem and SQL and Materialized views is considerable.
Conclusions
- The first problem showed us how to use Active Record to make it more powerful using SQL queries
- The second problem shows us that it is OK to use Ruby to support Active Record.
- The third one shows us that Active Record can also have a limit, and we can use SQL when needed.
Remember
- If you will handle a considerable amount of data, try to get only the needed data. You can use the
select
andpluck
methods from Active Record. - Sometimes, the Ruby methods are better than the Active Record ones.
- Don’t be afraid to use SQL with Active Record.
- Use
size
orlength
instead ofcount
if you don’t want to make an extra query. - Avoid
n + 1
queries withincludes
method.
I hope you enjoyed and learned something from this article. Thanks for reading! ❤️