Introduction Last updated: 2020-05-13

PyNNDB is a NoSQL Database library implemented in Python utilising the LMDB key-value storage store for back-end functionality. Although there might be an immediate perception that this idea might result in something that is a little "slow", it consists of a python layer backed by a powerful engine written in 'C', much like other popular Python database solutions and as a result generally compares favourably.

The idea of database logic coded in Python came about after many years of using both SQL and NoSQL databases with Python to write Cloud based microservices and coming across issued that were difficult to address with current main-stream solutions. Notably, it became apparent that the performance bottleneck in these applications ended up being in the Python layer between the application and the back-end database, a layer often added as a compatibility options to allow Python applications to access the database. This raised the question, "what if we pick a faster back-end, and make this Python layer a first-class component rather than an after-thought?"

Design requirements

Unfortunately the modern decision making process when it comes to picking a database architecture often revolves around "which database system do I know" or "which database system does my cloud platform provider support". If however you're in a position to take a more pragmatic approach then currently there are many options to choose from and a more in-depth investigation could be very beneficial.

If you are working with very well-defined datasets with infrequently changing schemas (say for example a bank) then a schema based database (often SQL based) may well be the solution you're looking for. If on the other hand you're dealing with very dynamic data that maybe originates from or is delivered to Javascript or Web based solutions, the idea of something schemaless (often NoSQL based) may be more appealing. Certainly the thought of being able to add a database field without having to write an Alembic script and run a database migration on a live environment, is reason enough for many to go down the NoSQL route.

So with this in mind, the main drivers behind the PyNNDB library design were;

  • Must perform well in a Python multi-processing environment
  • Must integrate well with Python based microservices
  • Must be easy to deploy in a cloud environment
  • Must be easily testable in a Python environment
  • Must be schemaless
  • Must offer the same basic functionality as mainstream database engines
Features

To an extent the base features available (and the performance) lean heavily on the LMDB storage engine, the 'personality' however is more Pythonic and may be a little alien to users of modern-day SQL engines. On the one-hand you have many of the features found in current SQL and NoSQL databases, yet on the other we have some of the granularity found in older databases like "DBase". Some of the current highlights;

  • Schemaless data model
  • The database interface effectively reads and writes Python objects
  • Every record has a unique timestamped primary key
  • Secondary keys are virtual and can be defined on any product of the data
  • Secondary keys can handle both unique and duplicate options
  • The storage engine supports ACID Transactions
  • Multiple processes can read and write to the database concurrently
  • It's Quick!
What does it look like?

At a very basic level everything is controlled through the Manager class, which is simply a collection of 'known' databases. The creation of new objects is transparent, so all you need to do is reference a database, then reference a table within that database, and you will be in a position to write (and then read) data to / from that table. For exmple;

from pynndb import Manager, Doc
db = Manager().database('database_name', '.database_path')
people = db.table('people')
people.append(Doc({'name': 'Fred Bloggs', 'age': 21}))
people.append(Doc({'name': 'Joe Smith', 'age': 22}))
people.append(Doc({'name': 'John Doe', 'age': 19}))

Will write three records into a table called people in a database calles database_name which will be stored in a directory called .database_path. There are a number of ways in which this data can subsequently be recovered, but at a basic level;

>>> for doc in people.find():
...     print(f'key={doc.key} data={doc.doc}')
...
key=5ebbd846a732575766a79fde data={'name': 'Fred Bloggs', 'age': 21}
key=5ebbd846a732575766a79fdf data={'name': 'Joe Smith', 'age': 22}
key=5ebbd846a732575766a79fe0 data={'name': 'John Doe', 'age': 19}

Alternatively there are other ways in which you can reference the data, fields are presented as both attributes (with an '_' prefix) or as dictionary items, or indeed the entire record is available as above.


>>> for doc in people.find():
...     print(f'key={doc.key} name={doc._name} age={doc._age}')
...
key=5ebbd846a732575766a79fde name=Fred Bloggs age=21
key=5ebbd846a732575766a79fdf name=Joe Smith age=22
key=5ebbd846a732575766a79fe0 name=John Doe age=19
>>>
>>> for doc in people.find():
...     print(f'key={doc.key} name={doc["name"]} age={doc["age"]}')
...
key=5ebbd846a732575766a79fde name=Fred Bloggs age=21
key=5ebbd846a732575766a79fdf name=Joe Smith age=22
key=5ebbd846a732575766a79fe0 name=John Doe age=19
>>>
>>> for doc in people.find():
...     print('key={key} name={name} age={age}'.format(key=doc.key, **doc.doc))
...
key=5ebbd846a732575766a79fde name=Fred Bloggs age=21
key=5ebbd846a732575766a79fdf name=Joe Smith age=22
key=5ebbd846a732575766a79fe0 name=John Doe age=19

It's maybe worth noting at this point that there is a slight difference here between versions one and two of the library. In the previous version were were reading and writing a raw 'dict' and the primary key was available as the '_id' field within the record. In version 2.x, we are reading and witing a Doc object which has a 'key' attribute to represent the primary key. Technically this approach is slightly slower, but it does give us many more options when moving on to more advanced topics such as replication. (the Doc objects supports change management)

Other Key Features

  • Every database operation must take place within an isolated transaction which ensures that you have consistency within that transaction. If you do not supply a transaction for any given operation, be aware that the library will attempt to create an ad-hock transaction for you. In a production environment you should be wrapping database access within explicit read or write transactions noting that write transactions are global and only one may be active in any given process at any given time.
  • Database and table creations are implicit, when you reference them they will be opened if they exist or created and opened if they do not. You don't need any special case handling to create databases on a first run, but it's worth specifying all currently required database attributes when opening a database, just in case it needs to be created.
  • Index creation is similar in nature, use ensure to open an index, and if it does not exist, it is created and current data indexed.
  • We now support compression on a table by table basis, currently both ZSTD and Snappy libraries are supported, although adding new libraries should be a relatively trivial process as the current mechanism is designed to be 'pluggable'. In the case of ZSTD, we have an option to create a compression dictionary either explicitly from sample data, or by sampling directly from a pre-existing table, so you can for example store data in a table, run the training routine on that table, then compress the table based on the results of the training pass.
  • The range option can be used to select data from tables using a number of criteria including duplicate or unique indexes and partial key matches. It can also be used to scan indexs "only", just returning indexing information which is much faster and often useful in map-reduce type scenarios.

Primary Keys

Every record that is written into the database is written against a primary key which by default is auto-generated using the ObjectId function from the BSON package. You can read about the structure of these ID's here. Keys have some interesting properties that it's worth being aware of;

  • Essentially these id's will always be unique, but it's worth noting they also contain the time of creation so they will always appear in data/time order and can also double as a timestamp.
  • The key itself is not stored in the data component of the record, however when a record is read from the database, a field it is made available as the "key" attribute of the Doc object recovered.
  • When you create (append) a new record, following the write operation the Doc object you wrote will have it's key attribute updated to reflect the primary key the data was written against. (much the same as MySQL's last_insert_id())

    >>> doc = Doc({'name': 'fred'})
    >>> people.append(doc)
    <Doc object with id=b'5ebbe230a73257581c79944c'>
    >>> print(doc.key)
    5ebbe230a73257581c79944c
    >>>

    Note that internally the database reads and writes keys as 'byte' strings, but the library should always return a Python 'str' type.

Secondary Keys

Secondary keys are implemented as functions that operate on the data within any given record. The product of this function is then stored as a primary key in an index table against the primary key of the data record we indexed. When we want to search on an index, we are effectively searching the index table's primary key for our search string, and on a match, we read the associated primary key for the data record from the index, and use it to recover the data record from the data table.

Index functions can take one of two forms;

  • A Python 'f' format string
  • A complete Python function
In many cases format strings will suffice and they are relatively easy to understand and manage, and very efficient. You should only resort to the second mechanism when you can't do what is required via format strings. (for example indexing on multiple sub-attributes) So based on our previous examples;

>>> for doc in people.find():
...     print(doc.doc)
...
{'name': 'Fred Bloggs', 'age': 21}
{'name': 'Joe Smith', 'age': 22}
{'name': 'John Doe', 'age': 19}
{'name': 'fred'}
>>> people.ensure('by_age_name', '{age:>3}|{name}')
>>> for doc in people.find('by_age_name'):
...     print(doc.doc)
...
{'name': 'John Doe', 'age': 19}
{'name': 'Fred Bloggs', 'age': 21}
{'name': 'Joe Smith', 'age': 22}

Note that in this example, we have an entry with no 'age' field, as a result it will not be included in an index that relies on the field being present. The format string we provide is literally used to generate an index entry, so we can use format strings to form many types of compoung indexes, not that when indexing on numerical fields be careful to use format strings that right justify numbers and/or insert leading zeros.

Full function indexes can do more interesting things, as an example, consider a table containing emails, where each email has a tags fiels which is a list of tags that apply to that email, say for example ['important', 'business']. You can generate an index such that it can recover results based on which tags are associated with which emails.

from pynndb import Manager, Doc
db=Manager().database('.database', '.database')
email=db.table('email')
email.append(Doc({'from': 'fred', 'Subject': 'Test', 'tags': ['important', 'personal']}))
email.append(Doc({'from': 'tom', 'Subject': 'Test', 'tags': ['important', 'business']}))
email.append(Doc({'from': 'harry', 'Subject': 'Test', 'tags': ['important', 'none']}))
email.append(Doc({'from': 'bill', 'Subject': 'Test', 'tags': ['none']}))
email.ensure('by_tags', 'def func(doc): return [tag.encode() for tag in doc["tags"]]', duplicates=True)
#
#   Show all records using the new index
#
>>> for doc in email.find('by_tags'):
...     print(doc.doc)
...
{'from': 'tom', 'Subject': 'Test', 'tags': ['important', 'business']}
{'from': 'fred', 'Subject': 'Test', 'tags': ['important', 'personal']}
{'from': 'tom', 'Subject': 'Test', 'tags': ['important', 'business']}
{'from': 'harry', 'Subject': 'Test', 'tags': ['important', 'none']}
{'from': 'harry', 'Subject': 'Test', 'tags': ['important', 'none']}
{'from': 'bill', 'Subject': 'Test', 'tags': ['none']}
{'from': 'fred', 'Subject': 'Test', 'tags': ['important', 'personal']}
#
#   Show all records tagged as personal
#
>>> for doc in email.seek('by_tags', Doc({'tags': ['personal']})):
...     print(doc._from, doc._tags)
...
fred ['important', 'personal']
#
#   Show all records tagged as 'personal' or 'none'
#
>>> for doc in email.seek('by_tags', Doc({'tags': ['personal', 'none']})):
...     print(doc._from, doc._tags)
...
fred ['important', 'personal']
harry ['important', 'none']
bill ['none']

If you do go down this route, it's worth making sure your function works correctly in the a local Python shell before committing it to "ensure", specifically it needs to returns a list of zero or more bytestrings where each bytestring will be treated as a distrinct index entry. (so make sure you ".encode()" any strings before you add them to your results list) Note that this harks back to the internals of the library working with bytes vs the API working with text strings, this mechanism is inserting a Python function into the internals hence gets to work in 'bytes' ...

 

Installation

Please note that we only support Linux and Unix based systems, if you use any other Operating System we recommend you work on Linux within a virtual environment. The recommended way to install the pynndb package is either via pip or pipenv, within a virtual environment.


To install from source use;
mkvirtualenv -p python3 myproj
git clone https://gitlab.com/oddjobz/pynndb2.git
cd pynndb2
pipenv install
python setup.py install
python3 -c 'import pynndb'

For pip use;
mkvirtualenv -p python3 myproj
pip install pynndb2
python3 -c 'import pynndb2'

For pipenv use;
pipenv shell myproj
pipenv install pynndb2
python3 -c 'import pynndb2'

Examples

Some of these examples are very basic, however some are actually quite complex and potentially useful in their own right. As more code becomes available, more will be added here, so watch this space!

  • bulk-import - this is an example of a bulk import routine that is capable of utilizing lots of CPU cores with a view to getting CSV style date into a database. It's been tested to ~ 250k records per second on a 12 core workstation
  • people - this is a tiny snippet showing how various basic database functions work including adding and searching data in a very basic form. It's not useful in it's own right beyond some basic examples of how to do things
  • benchmark - this is a code snippet inherited from version #1 which simply dumps a chunk of repeating data into a database table, then reads it back, measuring time taken with and without indexes

Database Shell

This needs to be improved (greatly) for version 2, but for now this is a simple tutorial that takes you through the basics, essentially it gives you access to databases, their structures, and some rudimentary data recovery, but this is more from a management perspective, it is not designed to be as complete as an SQL shell.

How to use the CLI

First you will need to register your database with the shell, this is done by creating a link to your DB like this.

PyNNDB Command Line Interface v1.0.9
pynndb> register /forum/forums forums
pynndb> use forums
forums> show tables
┌────────────┬────────┬───────┬────────┬─────────────┐
│ Table name │ # Recs │ Depth │ Oflow% │ Index names │
├────────────┼────────┼───────┼────────┼─────────────┤
│ boards     │     24 │     2 │      0 │             │
│ categories │      3 │     1 │      0 │             │
│ members    │   5391 │     3 │      0 │             │
│ messages   │  48193 │     3 │     39 │             │
│ topics     │   4176 │     3 │      0 │             │
└────────────┴────────┴───────┴────────┴─────────────┘

In this case I'm using an import of a MySQL based SMF forum database, just to get some data to play with. So, next lets have a look at the structure of the data in one of the tables. Because this is JSON there is no schema as-such, but typically rows have relatively consistent attributes, so we read a sample of the database and list the most common fields.

forums> explain boards
┌───────────────────────────┬─────────────┬─────────────────────────────────────────────────────────────────────────┐
│ Field name                │ Field Types │ Sample                                                                  │
├───────────────────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────┤
│ id_board                  │ ['int']     │                                                                     204 │
│ id_cat                    │ ['int']     │                                                                      22 │
│ child_level               │ ['int']     │                                                                         │
│ id_parent                 │ ['int']     │                                                                         │
│ board_order               │ ['int']     │                                                                      10 │
│ id_last_msg               │ ['int']     │                                                                  110106 │
│ id_msg_updated            │ ['int']     │                                                                  110106 │
│ member_groups             │ ['str']     │ -1,0,2                                                                  │
│ id_profile                │ ['int']     │                                                                       1 │
│ name                      │ ['str']     │ Android devices                                                         │
│ description               │ ['str']     │ Support queries relating to Android tablets, phones, and media players. │
│ num_topics                │ ['int']     │                                                                      22 │
│ num_posts                 │ ['int']     │                                                                     245 │
│ count_posts               │ ['int']     │                                                                         │
│ id_theme                  │ ['int']     │                                                                         │
│ override_theme            │ ['int']     │                                                                         │
│ unapproved_posts          │ ['int']     │                                                                         │
│ unapproved_topics         │ ['int']     │                                                                         │
│ redirect                  │ ['str']     │                                                                         │
│ is_blog                   │ ['int']     │                                                                       1 │
│ blog_alias                │ ['str']     │ blogs_162                                                               │
│ facebooklike_board_enable │ ['int']     │                                                                       1 │
└───────────────────────────┴─────────────┴─────────────────────────────────────────────────────────────────────────┘

Next, let's list some records from that table, by default we're limited to 10 records, but you can change with with;

set limit <n>

So to recover records, it's "find table {fields}", auto-complete should be available on in most instances

forums> find boards _id id_board name num_posts
┌─────────────────────────────┬──────────┬──────────────────────────────────────────┬───────────┐
│ _id                         │ id_board │ name                                     │ num_posts │
├─────────────────────────────┼──────────┼──────────────────────────────────────────┼───────────┤
│ 5cbcc8209d1a2a059e6c232e    │        1 │ General Discussion                       │      9905 │
│ 5cbcc8209d1a2a059e6c232f    │       52 │ News and Events                          │      1349 │
│ 5cbcc8209d1a2a059e6c2330    │      102 │ Frequently Asked Questions               │        29 │
│ 5cbcc8209d1a2a059e6c2331    │      162 │ Blogs                                    │       189 │
│ 5cbcc8209d1a2a059e6c2332    │      142 │ Linux Tips &     Tricks                  │       723 │
│ 5cbcc8209d1a2a059e6c2333    │      172 │ The Penguins second hand Linux Emporium  │       139 │
│ 5cbcc8209d1a2a059e6c2334    │      182 │ Jobs Offered                             │       149 │
│ 5cbcc8209d1a2a059e6c2335    │      192 │ Jobs Wanted                              │        23 │
│ 5cbcc8209d1a2a059e6c2336    │       22 │ General Help &     Advice                │     19680 │
│ 5cbcc8209d1a2a059e6c2337    │      204 │ Android devices                          │       245 │
└─────────────────────────────┴──────────┴──────────────────────────────────────────┴───────────┘
Displayed 10 records in 0.0003s (Limited view) 28571/sec

If you want the whole story, you can dump records by adding "-d", which will show the expanded JSON rather than a pretty printed table. You can get more information with "help" from the command prompt.

forums> find boards -d -l 1
{
    "_id": "5cbcc8209d1a2a059e6c232e",
    "blog_alias": "",
    "board_order": 1,
    "child_level": 0,
    "count_posts": 0,
    "description": "Feel free to talk about anything and everything in this board.",
    "facebooklike_board_enable": 1,
    "id_board": 1,
    "id_cat": 1,
    "id_last_msg": 110444,
    "id_msg_updated": 110444,
    "id_parent": 0,
    "id_profile": 1,
    "id_theme": 0,
    "is_blog": 0,
    "member_groups": "-1,0,2",
    "name": "General Discussion",
    "num_posts": 9905,
    "num_topics": 1003,
    "override_theme": 0,
    "redirect": "",
    "unapproved_posts": 0,
    "unapproved_topics": 0
}

Other useful commands can be found via "help", in particular;

show databases
show tables
explain <table>
find <table> {fields}
analyse <table>

Analyse table is useful for seeing the distributoin of records in terms of size, it attempts to print a histogram and count of records falling into different size groupings.

forums> analyse boards

Breakdown of record size distribution
###############################################################################
                                                                         0  29
                                                                         0  59
                                                                         0  88
                                                                         0  118
                                                                         0  148
                                                                         0  177
                                                                         0  207
                                                                         0  237
                                                                         0  266
                                                                         0  296
                                                                         0  326
                                                                         0  355
                                                                         0  385
                                                                         0  415
██████████████████████████████████████                                   6  444
██████████████████████████████████████████████████████████████████████  11  474
█████████████████████████                                                4  504
                                                                         0  533
                                                                         0  563
████████████                                                             2  592
██████                                                                   1  622

Tutorials

This needs to be improved (greatly) for version 2, but for now this is a simple tutorial that takes you through the basics, essentially it gives you access to databases, their structures, and some rudimentary data recovery, but this is more from a management perspective, it is not designed to be as complete as an SQL shell.

Will appear here ...

Replication

Will appear here ...

Resources

Will appear here ...

FAQ

Will appear here ...