MongoDB

Zachary Greenberg
5 min readNov 5, 2021
Image Source

There are other databases out there aside from SQL. Believe it or not, some of these are called NoSQL databases. It is important to understand how we can work with them because it will widen our tool belts as Data Scientists. MongoDB is a popular NoSQL. The structure of these databases are actually quite similar to JSON structures. So, if we are familiar with taking information from an API, we already have a leg up on trying to understand MongoDB.

For the purposes of this demonstration, I have created my own database of animated films that I webscraped from IMDB. The code for the webscrape and putting the database together can be found here. In this repository you will also find all of the code you will see below as well.

MongoDB has become pretty easy to interact with because of pymongo. It is a Python package that connects us with a database in a python environment. A database is like how it is in a SQL database. The tables in the database are known as collections. The entries in these tables are known as documents, and from there is trickles down the the values or the raw information stored.

When we have established a connection to the MongoDB server, this variable is typically called the ‘client’, we can view the databases on the server as well as the collections within a selected database with the following code:

client.list_database_names()db = client['movies']db.list_collection_names()

The top line will output a list of the database names. The one we are interested in is the ‘movies’ database. When we create a variable with the ‘movies’ database we can then see a list of collections (again think of them as tables). In this case, the collection we are looking for is called animatedFilms.

Now, let’s say we want to count the entries in the database. We can think of this like counting the number of rows in a SQL table. We can do that with this:

#this will get all documents
db.animatedFilms.count_documents({})
#this will get all documents in the year 2001:
db.animatedFilms.count_documents({'year':'2001'})

The empty dictionary in the code above will allow us to access all documents. If we wanted to specify a specific condition, we can put that as key value pairs of a dictionary. Now, let’s say we wanted to see a specific number of documents, in the database. This would be like looking at the head of a Pandas DataFrame. We can do that with this code:

list(db.animatedFilms.find({}).limit(5))

The empty dictionary allows us to look for all documents. When we cast this code as a list, it allows us to view the structure. Keep in mind that some keys may not be present for all documents. That is not the case for the database I have created here.

Upon looking at the webpage, I was able to see that in this list of films, a director who is involved a lot with many of these films is Hayao Miyazaki. We can find out the amount of films he is involved in as a director as well as the title of these films:

directed_by = {'directors': {'$in':['Hayao Miyazaki']}}db.animatedFilms.count_documents(directed_by)cursor = db.animatedFilms.find(directed_by)for doc in cursor:
print(doc['title'])

The first line of code above gives is the criteria that we are searching for. The ‘$in’ function is from a list of query operators that allow us to search for conditional information, for example whether or not that director’s name is present in the director field. The middle line of code gets us the number of documents (in this case number of films) that Hayao Miyazaki is present in. Additionally, the last lines of code are searching for the documents where that condition is true, and finding the movie ‘title’ for each of the documents.

Those query operators, mentioned as the ‘$in’ above, can prove to be quite useful. A list of them can be found here. Let’s say we want to find out the amount of films that generated a gross above a certain point. We can do that with the ‘$gt’ operator (gt meaning greater than):

high_gross = {'gross': {'$gte': 20000000}}db.animatedFilms.count_documents(high_gross)

We can also search for distinct values, allowing us to see what is being represented in the data. For example, let’s say we want to know the specific years that these movies were made. We can do that by looking at the distinct values of this field within the dataset:

count_years = len(db.animatedFilms.distinct('year'))years = sorted(list(db.animatedFilms.distinct('year')))

The first line gets us the total number of distinct years represented and the bottom line will get us a list of all the distinct values present.

Another thing we look for in the data is missing values. In this data, some of the films do not have a gross amount. They are labeled as np.nan values. We can filter for that as well. And for this, we are interested in what particular films these are:

missing_gross = db.animatedFilms.find({'gross': {'$type': 'double'}}, projection={'title':1, '_id':0})for missing in missing_gross:
print(missing)

We are looking for all of the films with a gross value that is of the datatype ‘double’. This is MongoDB’s version of a float. An np.nan value is a float. Additionally, the projection parameter will give us the specific column(s) we are looking for. It is necessary to specify the _id column as 0 if we do NOT want it to come up in the results.

Finally, it is important to work with aggregations when we have numerical data, so that we can gage averages and other statistics. This is how we do this with MongoDB:

list(db.animatedFilms.aggregate([{'$group': {'_id':None, 'runtime': {'$avg':"$runtime"} } }]))

So, as we know from SQL, getting the average of something is an aggregate, which normally involves a group by. To do this in MongoDB, we need to aggregate over a group, if we want all things to be considered, we would set the group to the id column and use the correct operator to get the aggregate value we are searching for.

To sum up, MongoDB is a type of NoSQL database. It has a similar structure to a JSON file involving lots of nested dictionaries. The ways to go about doing what we need to do are tricky in terms of the syntax, but intuitive once we get the hang of it. It is great to know the basics because you never know when you’ll have to query up a NoSQL database.

References

MongoDB — https://www.mongodb.com/

pymongo — https://pymongo.readthedocs.io/en/stable/

operators — https://docs.mongodb.com/manual/reference/operator/query/

--

--