[{"title": "Benchmarks for BERT Large Question Answering inference for RedisAI and RedisGears", "url": "https://applied-knowledge.systems/docs/bert-qa-benchmarking/", "body": "
This article will explore the challenges and opportunities of deploying a large BERT Question Answering Transformer model(bert-large-uncased-whole-word-masking-finetuned-squad) from inside Huggingface, where RedisGears and RedisAI perform heavy lifting while leveraging in-memory datastore Redis.
\nSome numbers for inspiration and why to read this article:
\npython3 transformers_plain_bert_qa.py \nairborne transmission of respiratory infections is the lack of established methods for the detection of airborne respiratory microorganisms\n10.351818372 seconds\n
\ntime curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' http://localhost:8080/qasearch\n\nreal\t0m0.747s\nuser\t0m0.004s\nsys\t0m0.000s\n\n
\nBERT Question Answering inference works where the ML model selects an answer from the given text. In other words, BERT QA "thinks" through the following: "What is the answer from the text, assuming the answer to the question exists within the paragraph selected."
\nSo it's important to select text potentially containing an answer. A typical pattern is to use Wikipedia data to build Open Domain Question Answering.
\nOur QA system is a medical domain-specific question/answering pipeline, hence we need a first pipeline that turns data into a knowledge graph. This NLP pipeline is available at Redis LaunchPad, is fully open source, and is described in a previous article. Here is a 5 minute video describing it, and below you will find an architectural overview:
\nIn the BERT QA pipeline (or in any other modern NLP inference task), there are two steps:
\nWith Redis, we have the opportunity to pre-compute everything and store it in memory, but how do we do it? Unlike with the summarization ML learning task, the question is not known in advance, so we can't pre-compute all possible answers. However, we can pre-tokenize all potential answers (i.e. all paragraphs in the dataset) using RedisGears:
\ndef parse_sentence(record):\n import redisAI\n import numpy as np\n global tokenizer\n if not tokenizer:\n tokenizer=loadTokeniser()\n hash_tag="{%s}" % hashtag()\n\n for idx, value in sorted(record['value'].items(), key=lambda item: int(item[0])):\n tokens = tokenizer.encode(value, add_special_tokens=False, max_length=511, truncation=True, return_tensors="np")\n tokens = np.append(tokens,tokenizer.sep_token_id).astype(np.int64)\n tensor=redisAI.createTensorFromBlob('INT64', tokens.shape, tokens.tobytes())\n\n key_prefix='sentence:'\n sentence_key=remove_prefix(record['key'],key_prefix)\n token_key = f"tokenized:bert:qa:{sentence_key}:{idx}"\n redisAI.setTensorInKey(token_key, tensor)\n execute('SADD',f'processed_docs_stage3_tokenized{hash_tag}', token_key)\n\n
\nSee the full code on GitHub.
\nThen for each Redis Cluster shard, we pre-load the BERT QA model by downloading, exporting it into torchscript, then loading it into each shard:
\ndef load_bert():\n model_file = 'traced_bert_qa.pt'\n\n with open(model_file, 'rb') as f:\n model = f.read()\n startup_nodes = [{"host": "127.0.0.1", "port": "30001"}, {"host": "127.0.0.1", "port":"30002"}, {"host":"127.0.0.1", "port":"30003"}]\n cc = ClusterClient(startup_nodes = startup_nodes)\n hash_tags = cc.execute_command("RG.PYEXECUTE", "gb = GB('ShardsIDReader').map(lambda x:hashtag()).run()")[0]\n print(hash_tags)\n for hash_tag in hash_tags:\n print("Loading model bert-qa{%s}" %hash_tag.decode('utf-8'))\n cc.modelset('bert-qa{%s}' %hash_tag.decode('utf-8'), 'TORCH', 'CPU', model)\n print(cc.infoget('bert-qa{%s}' %hash_tag.decode('utf-8')))\n
\nThe full code is available on GitHub.
\nAnd when a question comes from the user, we tokenize and append the question to the list of potential answers before running the RedisAI model:
\n token_key = f"tokenized:bert:qa:{sentence_key}"\n # encode question\n input_ids_question = tokenizer.encode(question, add_special_tokens=True, truncation=True, return_tensors="np")\n t=redisAI.getTensorFromKey(token_key)\n input_ids_context=to_np(t,np.int64)\n # merge (append) with potential answer, context - is pre-tokenized paragraph\n input_ids = np.append(input_ids_question,input_ids_context)\n attention_mask = np.array([[1]*len(input_ids)])\n input_idss=np.array([input_ids])\n num_seg_a=input_ids_question.shape[1]\n num_seg_b=input_ids_context.shape[0]\n token_type_ids = np.array([0]*num_seg_a + [1]*num_seg_b)\n # create actual model runner for RedisAI\n modelRunner = redisAI.createModelRunner(f'bert-qa{hash_tag}')\n # make sure all types are correct\n input_idss_ts=redisAI.createTensorFromBlob('INT64', input_idss.shape, input_idss.tobytes())\n attention_mask_ts=redisAI.createTensorFromBlob('INT64', attention_mask.shape, attention_mask.tobytes())\n token_type_ids_ts=redisAI.createTensorFromBlob('INT64', token_type_ids.shape, token_type_ids.tobytes())\n redisAI.modelRunnerAddInput(modelRunner, 'input_ids', input_idss_ts)\n redisAI.modelRunnerAddInput(modelRunner, 'attention_mask', attention_mask_ts)\n redisAI.modelRunnerAddInput(modelRunner, 'token_type_ids', token_type_ids_ts)\n redisAI.modelRunnerAddOutput(modelRunner, 'answer_start_scores')\n redisAI.modelRunnerAddOutput(modelRunner, 'answer_end_scores')\n # run RedisAI model runner\n res = await redisAI.modelRunnerRunAsync(modelRunner)\n answer_start_scores=to_np(res[0],np.float32)\n answer_end_scores = to_np(res[1],np.float32)\n answer_start = np.argmax(answer_start_scores)\n answer_end = np.argmax(answer_end_scores) + 1\n answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end],skip_special_tokens = True))\n log("Answer "+str(answer))\n return answer\n\n
\nCheckout the full code, available on GitHub.
\nThe process for making a BERT QA API call looks like this:
\nHere I use two cool features of RedisGears: capturing events on key miss and using async/await to run RedisAI on each shard without locking the primary thread - so that Redis Cluster can continue to serve other customers. For benchmarks, caching responses from RedisAI is disabled. If you are getting response times in nanoseconds on the second call rather then milliseconds, check to make sure the line linked above is commented out.
\nPre-requisites for running the benchmark:
\nAssuming you are running Debian or Ubuntu and have Docker and docker-compose installed (or can create a virtual environment via conda), run the following commands:
\ngit clone --recurse-submodules https://github.com/applied-knowledge-systems/the-pattern.git\ncd the-pattern\n./bootstrap_benchmark.sh\n
\nThe above commands should end with a curl call to the qasearch API, since Redis caching is disabled for the benchmark.
\nNext, invoke curl like this:
\ntime curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' http://localhost:8080/qasearch\n
\nExpect the following output, or something similar based on your runtime environment:
\nHTTP/1.1 200 OK\nServer: nginx/1.18.0 (Ubuntu)\nDate: Sun, 29 May 2022 12:05:39 GMT\nContent-Type: application/json\nContent-Length: 2120\nConnection: keep-alive\n\n{"links":[{"created_at":"2002","rank":13,"source":"C0001486","target":"C0152083"}],"results":[{"answer":"adenovirus","sentence":"The medium of 40 T150 flasks of adenovirus transducer dec CAR CHO cells yielded 0 5 1 my of purified msCEACAM1a 1 4 protein","sentencekey":"sentence:PMC125375.xml:{mG}:202","title":"Crystal structure of murine sCEACAM1a[1,4]: a coronavirus receptor in the CEA family"}] OUTPUT_REDUCTED}\n
\nI modified the output of API for the benchmark to return results from all shards - even if the answer is empty, in the run above five shards return answers, overall API call response under second with all additional hops to search in RedisGraph.
\nI modified the output of the API for the benchmark to return results from all shards - even if the answer is empty. In the run above five shards return answers. The overall API call response takes less than one second with all additional hops to search in RedisGraph!
\nLet's dig deeper into what's happening under the hood:
\nYou should have a sentence key with shard id, which you get by looking at the "Cache key" from docker logs -f rgcluster
. In my setup the cache key is, "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults". If you think it looks like a function call it's because it is a function call. It is triggered if the key isn't present in the Redis Cluster, which for the benchmark will be every time since if you remember we disabled caching the output.
One more thing to figure out from the logs is the port of the shard corresponding to the hashtag, also known as the shard id
. It is the text found in betweeen the curly brackets – looks like {6fd}
above. The same will be in the output for the export_load
script. In my case the cache key was found in "30012.log", so my port is 30012.
Next I run the following command:
\nredis-cli -c -p 300012 -h 127.0.0.1 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"\n
\nand then run the benchmark:
\nredis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"\n====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======\n 10 requests completed in 0.04 seconds\n 50 parallel clients\n 3 bytes payload\n keep alive: 1\n\n10.00% <= 41 milliseconds\n100.00% <= 41 milliseconds\n238.10 requests per second\n
\nIf you are wondering, -n
= number of times. In this case we run the benchmark 10 times. You can also add:
– csv
if you want to output in CSV format
– precision 3
if you want more decimals in the ms
More information about the benchmarking tool can be found on the redis.io Benchmarks page.
\nif you don't have redis-utils installed locally, you can use Docker as follows:
\ndocker exec -it rgcluster /bin/bash\nredis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"\n====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======\n 10 requests completed in 1.75 seconds\n 50 parallel clients\n 99 bytes payload\n keep alive: 1\n host configuration "save":\n host configuration "appendonly": no\n multi-thread: no\n\nLatency by percentile distribution:\n0.000% <= 243.711 milliseconds (cumulative count 1)\n50.000% <= 987.135 milliseconds (cumulative count 5)\n75.000% <= 1577.983 milliseconds (cumulative count 8)\n87.500% <= 1662.975 milliseconds (cumulative count 9)\n93.750% <= 1744.895 milliseconds (cumulative count 10)\n100.000% <= 1744.895 milliseconds (cumulative count 10)\n\nCumulative distribution of latencies:\n0.000% <= 0.103 milliseconds (cumulative count 0)\n10.000% <= 244.223 milliseconds (cumulative count 1)\n20.000% <= 409.343 milliseconds (cumulative count 2)\n30.000% <= 575.487 milliseconds (cumulative count 3)\n40.000% <= 821.247 milliseconds (cumulative count 4)\n50.000% <= 987.135 milliseconds (cumulative count 5)\n60.000% <= 1157.119 milliseconds (cumulative count 6)\n70.000% <= 1497.087 milliseconds (cumulative count 7)\n80.000% <= 1577.983 milliseconds (cumulative count 8)\n90.000% <= 1662.975 milliseconds (cumulative count 9)\n100.000% <= 1744.895 milliseconds (cumulative count 10)\n\nSummary:\n throughput summary: 5.73 requests per second\n latency summary (msec):\n avg min p50 p95 p99 max\n 1067.296 243.584 987.135 1744.895 1744.895 1744.895\n
\nThe platform only has 20 articles and 8 Redis nodes (4 masters + 4 slaves), so relevance would be wrong and it doesn't need a lot of memory.
\nNow let's check how long our RedisAI model runs on the {6fd}
shard:
127.0.0.1:30012> AI.INFO bert-qa{6fd}\n 1) "key"\n 2) "bert-qa{6fd}"\n 3) "type"\n 4) "MODEL"\n 5) "backend"\n 6) "TORCH"\n 7) "device"\n 8) "CPU"\n 9) "tag"\n10) ""\n11) "duration"\n12) (integer) 8928136\n13) "samples"\n14) (integer) 58\n15) "calls"\n16) (integer) 58\n17) "errors"\n18) (integer) 0\n\n
\nbert-qa{6fd}
is the key of the actual (very large) model saved. The AI.INFO
command gives us a cumulative duration of 8928136 microseconds and 58 calls, which is approximately 153 milliseconds per call.
Let's double-check to make sure that's right by resetting the stats and then re-runnning the benchmark.
\nFirst, reset the stats:
\n127.0.0.1:30012> AI.INFO bert-qa{6fd} RESETSTAT\nOK\n127.0.0.1:30012> AI.INFO bert-qa{6fd}\n 1) "key"\n 2) "bert-qa{6fd}"\n 3) "type"\n 4) "MODEL"\n 5) "backend"\n 6) "TORCH"\n 7) "device"\n 8) "CPU"\n 9) "tag"\n10) ""\n11) "duration"\n12) (integer) 0\n13) "samples"\n14) (integer) 0\n15) "calls"\n16) (integer) 0\n17) "errors"\n18) (integer) 0\n
\nThen, re-run the benchmark:
\nredis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"\n====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======\n 10 requests completed in 1.78 seconds\n 50 parallel clients\n 99 bytes payload\n keep alive: 1\n host configuration "save":\n host configuration "appendonly": no\n multi-thread: no\n\nLatency by percentile distribution:\n0.000% <= 188.927 milliseconds (cumulative count 1)\n50.000% <= 995.839 milliseconds (cumulative count 5)\n75.000% <= 1606.655 milliseconds (cumulative count 8)\n87.500% <= 1692.671 milliseconds (cumulative count 9)\n93.750% <= 1779.711 milliseconds (cumulative count 10)\n100.000% <= 1779.711 milliseconds (cumulative count 10)\n\nCumulative distribution of latencies:\n0.000% <= 0.103 milliseconds (cumulative count 0)\n10.000% <= 189.183 milliseconds (cumulative count 1)\n20.000% <= 392.191 milliseconds (cumulative count 2)\n30.000% <= 540.159 milliseconds (cumulative count 3)\n40.000% <= 896.511 milliseconds (cumulative count 4)\n50.000% <= 996.351 milliseconds (cumulative count 5)\n60.000% <= 1260.543 milliseconds (cumulative count 6)\n70.000% <= 1456.127 milliseconds (cumulative count 7)\n80.000% <= 1606.655 milliseconds (cumulative count 8)\n90.000% <= 1692.671 milliseconds (cumulative count 9)\n100.000% <= 1779.711 milliseconds (cumulative count 10)\n\nSummary:\n throughput summary: 5.62 requests per second\n latency summary (msec):\n avg min p50 p95 p99 max\n 1080.454 188.800 995.839 1779.711 1779.711 1779.711\n
\nNow check the stats again:
\nAI.INFO bert-qa{6fd}\n 1) "key"\n 2) "bert-qa{6fd}"\n 3) "type"\n 4) "MODEL"\n 5) "backend"\n 6) "TORCH"\n 7) "device"\n 8) "CPU"\n 9) "tag"\n10) ""\n11) "duration"\n12) (integer) 1767749\n13) "samples"\n14) (integer) 20\n15) "calls"\n16) (integer) 20\n17) "errors"\n18) (integer) 0\n
\nNow we get 88387.45 microseconds per call ~0.088387 seconds, which is pretty fast! Also, considering we started with 10 seconds per call, I think the benefits of using RedisAI in combination with RedisGears are pretty obvious. However, the trade-off is high memory usage.
\nThere are many ways to optimize this deployment. For example, you can add a FP16 quantization and ONNX runtime. If you would like to try that, this script will be a good starting point.
\nThanks to the contribution of Mikhail Volkov, we can now observe RedisGears and RedisGraph throughput and memory consumption using Grafana. When you cloned repository it started Graphana Docker, which has pre-build templates to monitor RedisCluster, including RedisGears and RedisAI, and Graph - which is Redis with RedisGraph. "The Pattern" dashboard provides an overview, with all the key benchmark metrics you care about:
\nThis post is in collaboration with Redis.
\n" },{"title": "Building a Pipeline for Natural Language Processing using RedisGears", "url": "https://applied-knowledge.systems/docs/nlp/", "body": "Disclaimer originally published in collaboration with Ajeet Raina on Developer.Redis.Com
\nIn this tutorial, you will learn how to build a pipeline for Natural Language Processing(NLP) using RedisGears. For this demonstration, we will be leveraging the Kaggle CORD19 datasets. The implementation is designed to avoid running out of memory, leveraging Redis Cluster and RedisGears, where the use of RedisGears allows for processing data on storage without the need to move data in and out of the Redis Cluster—using Redis Cluster as data fabric. Redis Cluster allows for horizontal scalability up to 1000 nodes, and together with RedisGears, provides a distributed system where data science/ML engineers can focus on processing steps, without the worry of writing tons of scaffoldings for distributed calculations.
\nThis project was built with the aim to make it easier for other people to contribute and build better information and knowledge management products.
\nRedisGears have enormous potential, particularly for text processing—you can process your data “on data” without needing to move them in and out of memory. Summary of the important points:
\nToday, we live in the world of new systems that operate not just files, folders, or web pages, but entities with their properties and relationships between them, organized into hierarchies of classes and categories. These systems are used everywhere from the military-industrial complex to our everyday lives. Palantir, Primer, and other data companies enable massive intelligence and counterintelligence projects in military and security forces, Quid and RecordedFuture enable competitive analytics, Bottlenose and similar enterprises enable online reputation analytics. Microsoft Graph enables new kinds of productivity apps for the enterprises, Google Knowledge Graph and Microsoft’s Satori enable everyday search queries, and together with Amazon Information Graph they power corresponding AI assistants by enabling them to answer questions about the world facts
\nAll these (and many other more specialized) systems are used in different domains, but all of them use Knowledge Graphs as their foundation.
\nKnowledge graphs are one of the best ways to connect and make sense out of information from different data sources, following the motto of one of the vendors— “It’s about things not strings”.
\nKnowledge Graph consists of thesaurus, taxonomy and ontology. In this pipeline I assume knowledge is captured in medical metathesaurus UMLS and concepts in text are related if they are part of the same sentence, therefore concept become node, their relationship becomes edge:
\n\n
Concepts have CUI (Concept Unique Identifiers) and those will be primary keys in nodes, linked to UMLS thesaurus. For example, if you search, “How does temperature and humidity affect the transmission of 2019-nCoV?” on the demo website http://thepattern.digital/ and move slider to 1996, there is an edge-connecting transmission (C5190195) and birth (C5195639) and the part of sentence matched, “the rate of transmission to an infant born to,” from the report titled, “Afebrile Pneumonia in infants.”
\nIntake step - is very simple put all JSON records into RedisCluster, then NLP pipeline starts processing all records, code is here.
\nFor each record — detect language (discard non English), it’s filter
\nMap paragraphs into a sentence — flatmap
\nSentences spellchecker — it’s map
\nSave sentences into hash — processor
\nEnsure that you install virtualenv in your system
\n git clone --recurse-submodules https://github.com/applied-knowledge-systems/the-pattern.git\n cd the-pattern\n
\n docker-compose -f docker-compose.dev.yml up --build -d\n
\nYou can deploy PyTorch and spacy to run on RedisGears.
\n bash post_start_dev.sh\n
\nFor Data science-focused deployment, RedisCluster should be in HA mode with at least one slave for each master. \nOne need to change a few default parameters for rgcluster to accommodate the size of PyTorch and spacy libraries (each over 1GB zipped), gist with settings.
\n cd ./the-pattern-platform/\n
\nYou can create it via
\n conda create -n pattern_env python=3.8\n
\nor
\nAlternatively, you can activate by using the below CLI:
\n source ~/venv_cord19/bin/activate #or create new venv\n pip install -r requirements.txt\n
\n bash cluster_pipeline.sh\n
\nWait for a bit and then check:
\n redis-cli -p 9001 -h 127.0.0.1 GRAPH.QUERY cord19medical "MATCH (n:entity) RETURN count(n) as entity_count" \n redis-cli -p 9001 -h 127.0.0.1 GRAPH.QUERY cord19medical "MATCH (e:entity)-[r]->(t:entity) RETURN count(r) as edge_count"\n
\n curl -i -H "Content-Type: application/json" -X POST -d '{"search":"How does temperature and humidity affect the transmission of 2019-nCoV"}' \n http://localhost:8080/gsearch\n
\nWhile RedisGears allows to deploy and run Machine Learning libraries like spacy and BERT transformers, the solution above uses simpler approach:
\n gb = GB('KeysReader')\n gb.filter(filter_language)\n gb.flatmap(parse_paragraphs)\n gb.map(spellcheck_sentences)\n gb.foreach(save_sentences)\n gb.count()\n gb.register('paragraphs:*',keyTypes=['string','hash'], mode="async_local")\n
\nThis is the overall pipeline: those 7 lines allow you to run logic in a distributed cluster or on a single machine using all available CPUs - no changes required until you need to scale over more 1000 nodes. I use KeysReader registered for namespace paragraphs for all strings or hashes. My pipeline would need to run in async mode. For data scientists, I would recommend using gb.run to make sure gears function work and it will run in batch mode and then change it to register - to capture new data. By default, functions will return output, hence the need for count() - to prevent fetching the whole dataset back to the command issuing machine (90 GB for Cord19).
\nOverall pre-processing is a straightforward - full code is here.
\nThings to keep in mind:
\n execute('SADD','processed_docs_{%s}' % hashtag(),article_id)\n
\n """\n load symspell and relevant dictionaries\n """\n sym_spell=None \n\n def load_symspell():\n import pkg_resources\n from symspellpy import SymSpell, Verbosity\n sym_spell = SymSpell(max_dictionary_edit_distance=1, prefix_length=7)\n dictionary_path = pkg_resources.resource_filename(\n "symspellpy", "frequency_dictionary_en_82_765.txt")\n bigram_path = pkg_resources.resource_filename(\n "symspellpy", "frequency_bigramdictionary_en_243_342.txt")\n # term_index is the column of the term and count_index is the\n # column of the term frequency\n sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)\n sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)\n return sym_spell\n
\n bg = GearsBuilder('KeysReader')\n bg.foreach(process_item)\n bg.count()\n bg.register('sentence:*', mode="async_local",onRegistered=OnRegisteredAutomata)\n\n
\nOnRegisteredAutomata will perform similarly to symspell example above except it will download pre-build Aho-Corasick automata (30Mb). \nAho-Corasick is a very fast matcher and allows to perform >900 Mb text per second even on commodity laptop, RedisGears cluster makes a very smooth distribution of data and ML model and matching using available CPU and Memory. Full matcher code.
\nOutput of the matcher: nodes and edges are candidates to use another RedisGears pattern rgsync where you can write fast into Redis and RedisGears are going to replicate data into slower storage using RedisStreams.\nBut I decided to use streams and handcraft the population of the RedisGraph database, which will be focus of the next blog post.
\nOutput of the matcher: nodes and edges are candidates to use another RedisGears pattern rgsync where you can write fast into Redis and RedisGears are going to replicate data into slower storage using RedisStreams, while this demo uses streams and populates RedisGraph database with nodes and edges calculating rank of each.
\nWe took OCR scans in JSON format and turned them into Knowledge Graph, demonstrating how you can traditional Semantic Network/OWL/Methathesaurus technique based on Unified Medical Language System. Redis Ecosystem offers a lot to the data science community, and can take place at the core of Kaggle notebooks, ML frameworks and make deployment and distribution of data more enjoyable. The success of our industry depends on how our tools work together — regardless of whether they are engineering, data science, machine learning and organisational or architectural.
\nWith the collaboration of RedisLabs and community, the full pipeline code is available via https://github.com/applied-knowledge-systems/the-pattern-platform. In case, you want to try it locally, then you can find a Docker Launch script in the root of the repository along with short quickstart guide. PR and suggestions are welcome. The overall goal of the project is to allow other to build their more interesting pipeline on top of it.
\nIn CORD 19 dataset mentioned in Data Acquisition Metadata stored in the separate csv file from the source data. Here simple script to parse date/times and attach it to JSON/XML files
\n" },{"title": "Contribution Guidelines", "url": "https://applied-knowledge.systems/docs/contribution/", "body": "General guidelines for contributing to the project.
\nThere should be a a path to be implemented in a real world - good prototype or production deployment.
\nWelcome pull requests on \n\n\n\n. Check out supported shortcuts Extended Shortcuts
\nWhen contributing you agreeing to share your contribution under
To fight ever-increasing complexity, "The Pattern" projects help find relevant knowledge using Artificial Intelligence and novel UX elements, all powered by Redis - a new generation real-time data fabric turned into knowledge fabric
\nOverall repository for CORD19 medical NLP pipeline, API and UI, design and architecture.
\nDemo Video:
Demo Server (no persistance): https://thepattern.digital/
\nThe medical profession put a lot of effort into collaboration, starting from Latin as a common language to industry-wide thesauruses like UMLS. However, if full of scandals where publications in a prestigious journal would be retracted, and the World Health Organisation would change its policy advice based on the article. I think "paper claiming that eating a bat-like Pokémon sparked the spread of COVID-19" takes a prize. One would say that editors in those journals don't do their job, and while it may seem true, I would say they had no chance: with a number of publications about COVID (SARS-V) passing 300+ per day, we need better tools to navigate via such flow of information.\nWhen exploring science or engineering topics, I look at the diversity of the opinion, not the variety of the same cluster of words or the same thought. I want to avoid confirmation bias. I want to find articles relevant to the same concept, not necessarily the ones which have similar words. My focus is to build a natural language processing pipeline capable of handling a large number of documents and concepts, incorporating System 1 AI (fast, intuitive reasoning) and System 2 (high-level reasoning) and then present knowledge in a modern VR/AR visualisation. Search or rather information exploration should be spatial, preferably in VR (memory palace, see Theatre of Giulio Camillo). A force-directed graph is a path towards it, where visuals are assisted by text — relevant text pops up on the connection and where people explore the concepts and then dig deeper into the text. The purpose of the pipeline is that knowledge should be reusable and shareable.
\nJoin our community on Discord or post on GitHub Discussions](https://github.com/applied-knowledge-systems/the-pattern/discussions)
\n" },{"title": "Support project by contributing", "url": "https://applied-knowledge.systems/docs/donate/", "body": "This is a begging of an exciting, incredible new journey; support open source projects by donating or contributing.
\nBecomes a sponsor and promote Reference Architecture for AI.
\nThe ask is
\nFor the Reference Architecture for AI, we used Kaggle Cord19 dataset, "COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 1,000,000 scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease."
\nExample script parses documents taking out body_text and saves under paragraphs in Redis cluster.
\n" },{"title": "Extended Shortcodes", "url": "https://applied-knowledge.systems/docs/extended-shortcodes/", "body": "DeepThought theme provides multiple shortcodes on top of built-in ones in Zola.\nPlease, have a look at the Config Options\nthat explains how to enable them.
\nEmbed gist:
\n{{ gist(url="https://gist.github.com/Keats/e5fb6aad409f28721c0ba14161644c57", class="gist") }}\n\n
\nMermaid is a library helping you to generate diagrams and flowcharts from text in a similar manner as Markdown.
\nTo put a flowchart in your post, use the below snippet
\nCode
\n{% mermaid() %}\ngraph TD;\n A-->B;\n A-->C;\n B-->D;\n C-->D;\n{% end %}\n
\nOutput
\nTo put a sequence diagram in your post use below snippet
\nCode
\n{% mermaid() %}\nsequenceDiagram\n participant Alice\n participant Bob\n Alice->>John: Hello John, how are you?\n loop Healthcheck\n John->>John: Fight against hypochondria\n end\n Note right of John: Rational thoughts <br/>prevail!\n John-->>Alice: Great!\n John->>Bob: How about you?\n Bob-->>John: Jolly good!\n{% end %}\n
\nOutput
\nTo put a gantt diagram in your post use below snippet
\nCode
\n{% mermaid() %}\ngantt\n dateFormat YYYY-MM-DD\n title Adding GANTT diagram to mermaid\n excludes weekdays 2014-01-10\n\n section A section\n Completed task :done, des1, 2014-01-06,2014-01-08\n Active task :active, des2, 2014-01-09, 3d\n Future task : des3, after des2, 5d\n Future task2 : des4, after des3, 5d\n{% end %}\n
\nOutput
\nTo put a class diagram in your post use below snippet
\nCode
\n{% mermaid() %}\nclassDiagram\n Class01 <|-- AveryLongClass : Cool\n Class03 _-- Class04\n Class05 o-- Class06\n Class07 .. Class08\n Class09 --> C2 : Where am i?\n Class09 --_ C3\n Class09 --|> Class07\n Class07 : equals()\n Class07 : Object[] elementData\n Class01 : size()\n Class01 : int chimp\n Class01 : int gorilla\n Class08 <--> C2: Cool label\n{% end %}\n
\nOutput
\nTo put an ER diagram in your post use below snippet\nCode
\n{% mermaid() %}\nerDiagram\n CUSTOMER ||--o{ ORDER : places\n ORDER ||--|{ LINE-ITEM : contains\n CUSTOMER }|..|{ DELIVERY-ADDRESS : uses\n{% end %}\n
\nOutput
\nTo put an user journey diagram in your post use below snippet\nCode
\n{% mermaid() %}\n journey\n title My working day\n section Go to work\n Make tea: 5: Me\n Go upstairs: 3: Me\n Do work: 1: Me, Cat\n section Go home\n Go downstairs: 5: Me\n Sit down: 5: Me\n{% end %}\n
\nOutput
\nChart.xkcd is a chart library plots “sketchy”, “cartoony” or “hand-drawn” styled charts.
\nLine chart displays series of data points in the form of lines. It can be used to show trend data, or comparison of different data sets.
\nCode
\n{% chart() %}\n{\n "type": "Line",\n "title": "Monthly income of an indie developer",\n "xLabel": "Month",\n "yLabel": "$ Dollars",\n "data": {\n "labels": ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"],\n "datasets": [\n {\n "label": "Plan",\n "data": [30, 70, 200, 300, 500, 800, 1500, 2900, 5000, 8000]\n },\n {\n "label": "Reality",\n "data": [0, 1, 30, 70, 80, 100, 50, 80, 40, 150]\n }\n ]\n }\n}\n{% end %}\n
\nOutput
\n\nXY chart is used to plot points by specifying their XY coordinates.
\nCode
\n{% chart() %}\n{\n "type": "XY",\n "title": "Pokemon farms",\n "xLabel": "Coordinate",\n "yLabel": "Count",\n "data": {\n "datasets": [\n {\n "label": "Pikachu",\n "data": [\n {\n "x": 3,\n "y": 10\n },\n {\n "x": 4,\n "y": 122\n },\n {\n "x": 10,\n "y": 100\n }\n ]\n },\n {\n "label": "Squirtle",\n "data": [\n {\n "x": 1,\n "y": 1\n },\n {\n "x": 3,\n "y": 3\n }\n ]\n },\n {\n "label": "Charizard",\n "data": [\n {\n "x": 2,\n "y": 5\n },\n {\n "x": 6,\n "y": 8\n }\n ]\n }\n ]\n },\n "options": {\n "xTickCount": 5,\n "yTickCount": 5,\n "showLine": false,\n "dotSize": 1\n }\n}\n{% end %}\n
\nOutput
\n\nA bar chart provides a way of showing data values represented as vertical bars
\nCode
\n{% chart() %}\n{\n "type": "Bar",\n "title": "How to feel powerful ?",\n "xLabel": "Platforms",\n "yLabel": "Count",\n "data": {\n "labels": ["Earn Money", "Get Famous", "Use terminal in front of non-programmers"],\n "datasets": [\n {\n "data": [30, 45, 100]\n }\n ]\n },\n "options": {\n "yTickCount": 2,\n "dataColors": ["Red", "Green", "Blue"]\n }\n}\n{% end %}\n
\nOutput
\n\nA stacked bar chart provides a way of showing data values represented as vertical bars
\nCode
\n{% chart() %}\n{\n "type": "StackedBar",\n "title": "Issues and PR Submissions",\n "xLabel": "Month",\n "yLabel": "Count",\n "data": {\n "labels": ["Jan", "Feb", "Mar", "April", "May"],\n "datasets": [\n {\n "label": "Issues",\n "data": [12, 19, 11, 29, 17]\n },\n {\n "label": "PRs",\n "data": [3, 5, 2, 4, 1]\n },\n {\n "label": "Merges",\n "data": [2, 3, 0, 1, 1]\n }\n ]\n }\n}\n{% end %}\n
\nOutput
\n\nA pie/doughnut chart provides a way of illustrating numerical proportion.
\nCode
\n{% chart() %}\n{\n "type": "Pie",\n "title": "What Tim is made of",\n "data": {\n "labels": ["a", "b", "e", "f", "g"],\n "datasets": [\n {\n "data": [500, 200, 80, 90, 100]\n }\n ]\n }\n}\n{% end %}\n
\nOutput
\n\nA radar chart provides a way of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.
\nCode
\n{% chart() %}\n{\n "type": "Radar",\n "title": "Letters in random words",\n "data": {\n "labels": ["c", "h", "a", "r", "t"],\n "datasets": [\n {\n "label": "ccharrrt",\n "data": [2, 1, 1, 3, 1]\n },\n {\n "label": "chhaart",\n "data": [1, 2, 2, 1, 1]\n }\n ]\n },\n "options": {\n "showLegend": true,\n "showLabels": true\n }\n}\n{% end %}\n
\nOutput
\n\nGalleria is a framework that simplifies the process of creating beautiful image galleries for the web and mobile devices.
\nCode
\n{% galleria() %}\n{\n "images": [\n {\n "src": "alexandre-dinaut-GHxr3O6yZ1c-unsplash.jpg",\n "title": "Clouds & Mountains",\n "description": "Just hanging out with each other."\n },\n {\n "src": "chandler-cruttenden-YYemke7BfuE-unsplash.jpg",\n "title": "Crop",\n "description": "Waiting for the harvest."\n },\n {\n "src": "jung-ho-park-7aZtpsyaWVM-unsplash.jpg",\n "title": "The Fog",\n "description": "Engulfing everything."\n },\n {\n "src": "kitera-dent-BIj4LObC6es-unsplash.jpg",\n "title": "Just Plants",\n "description": "Backdrop of ocean."\n },\n {\n "src": "koes-nadi-XkUFF1nnbA8-unsplash.jpg",\n "title": "Whoa",\n "description": "Something to look at."\n },\n {\n "src": "lazyartistgallery-HHaIRbgzcGw-unsplash.jpg",\n "title": "Let's Chill",\n "description": "Three birds just chilling !!"\n },\n {\n "src": "saira-nUxdL_19OQw-unsplash.jpg",\n "title": "Canyon",\n "description": "Might of nature."\n },\n {\n "src": "waldemar-brandt-2hAEHCt25eM-unsplash.jpg",\n "title": "Evening",\n "description": "Time to wind down."\n }\n ]\n}\n{% end %}\n
\nOutput
\nKaTeX is a math typesetting library based on TeX.
\nCode
\n{% katex(block=true) %}\n\\KaTeX\n{% end %}\n
\nOutput
\n\nPhoto By:
\nLast year, my reference project, "The Pattern", was the hackathon winner 2021 and got a bit of publicity and, in total, seven forks. But as with many open source projects, it is now stale. Time to revive "The Pattern" with new features and GitHub sponsors or Patreon patrons to help and inspire developers and creatives. In return, it's common to provide sponsor-only features and articles. Nevertheless, how can we do it with a large Redis-based machine learning pipeline?
\nThis article will introduce a simple first step:\nfor GitHub sponsors, we start with offering persistent storage of preferences: I have a simple flask POST API which adds nodes into the user's preference storage - a simple Redis set per user. And it will be a foundation to build other sponsor-only features.\nFor now, let's cover the basics:
\nThere are a number of API's that GitHub offers to help developers, but the GitHub Authentication API is one of the most popular. This API allows you to log in to GitHub using your username and password, or OAuth token.
\nA login button with a standard OIDC/OAuth2 dance is one of the most common ways for a user to authenticate to an API.\nBelow is code taken from this gist and is very common for OAuth2 flows:
\nimport os \nclient_id = os.getenv('GITHUB_CLIENT_ID')\nclient_secret = os.getenv('GITHUB_SECRET')\n@app.route('/', methods=['GET', 'POST'])\ndef index():\n url = 'https://github.com/login/oauth/authorize'\n params = {\n 'client_id': client_id,\n\n 'scope': 'read:user,read:email',\n 'state': str(uuid4().hex),\n 'allow_signup': 'true'\n }\n url = furl(url).set(params)\n return redirect(str(url), 302)\n
\nwhere GITHUB_CLIENT_ID and GITHUB_SECRET are client/secret GitHub Oauth2 apps. Register for following process on GitHub
\norg_name="applied-knowledge-systems"\n@app.route('/oauth2/callback')\ndef oauth2_callback():\n\n code = request.args.get('code')\n access_token_url = 'https://github.com/login/oauth/access_token'\n payload = {\n 'client_id': client_id,\n 'client_secret': client_secret,\n 'code': code,\n # 'redirect_uri':\n 'state': str(uuid4().hex)\n }\n r = requests.post(access_token_url, json=payload, headers={'Accept': 'application/json'})\n access_token = json.loads(r.text).get('access_token')\n print(access_token)\n \n access_user_url = 'https://api.github.com/user'\n response = requests.get(access_user_url, headers={'Authorization': 'token ' + access_token})\n data=response.json()\n user_email=data["email"]\n user_login=data["login"]\n user_id=data["id"]\n # response=redirect(url_for('login',next=redirect_url()))\n # response.set_cookie('user_id', str(user_id))\n # response.set_cookie('user_login', str(user_login))\n # return response\n query = """\n {\n viewer {\n sponsorshipsAsSponsor(first: 100) {\n nodes {\n sponsorable {\n ... on User {\n id\n email\n url\n }\n ... on Organization {\n id\n email\n name\n url\n }\n }\n tier {\n id\n name\n monthlyPriceInDollars\n monthlyPriceInCents\n }\n }\n }\n }\n }\n """\n response_graphql = requests.post('https://api.github.com/graphql', json={'query': query}, headers={'Authorization': 'token ' + access_token})\n response_graphql_data=response_graphql.json()["data"]\n if isinstance (response_graphql_data["viewer"]["sponsorshipsAsSponsor"]["nodes"], list):\n if response_graphql_data["viewer"]["sponsorshipsAsSponsor"]["nodes"][0]["sponsorable"]["name"]==org_name:\n # if user is a sponsor of Applied Knowledge System add them to set of sponsors\n redis_client.sadd(f'sponsors:{org_name}',user_id)\n # if RedisJSON enabled:\n # redis_client.json().set(f"user_details:{user_id}", '$', {\n # 'email': user_email,\n # 'id': user_id,\n # 'user_login': user_login,\n # 'graphql': response_graphql_data,\n # }) \n #if not\n redis_client.hset(f"user_details:{user_id}", mapping={\n 'email': user_email,\n 'id': user_id,\n 'user_login': user_login\n })\n return jsonify({\n 'status': 'success',\n 'email': user_email,\n 'id': user_id,\n 'user_login': user_login\n })\n
\nThe API we are using for our sponsor-only feature is straightforward:
\n@app.route('/exclude', methods=['POST','GET'])\ndef mark_node():\n if request.method == 'POST':\n if 'id' in request.json:\n node_id=request.json['id']\n else:\n if 'id' in request.args:\n node_id=request.args.get('id')\n user_id = session.get('user_id')\n log(f"Got user {user_id} from session")\n if not user_id:\n user_id = request.cookies.get('user_id')\n log(f"Got user {user_id} from cookie")\n redis_client.sadd("user:%s:mnodes" % user_id,node_id)\n response = jsonify(message=f"Finished {node_id} and {user_id}")\n return response\n
\nAnd the only purpose of this API is to mark nodes as unimportant for the given user by adding nodes to RedisSet, and those nodes will be excluded from search API output. So far, everything was pretty standard: basic flask API and GitHub Social login flow. Now let's add Redis Enterprise and synchronise sponsors preferences.
\nWhy not use Redis Enterprise directly for everything? \nThe project is memory-heavy, with a lot of data and machine learning inside Redis. This allows to achieve state-of-the-art performance, but it also takes over 120 GB RAM (or as much RAM as you can give it), and 128 GB Redis Enterprise instance will exceed my budget for open-source project. Obviously if there will be enough sponsors we can move more functionality into Redis Enterprise, but for that we need to finish building basic blocks. Register on Redis.com cloud and create a database with the subscription.
\n\nTake a note host, port and password for Redis Enterprise and create docker enviroment file:
cat .env.gears \nREDISENT_PWD="123"\nREDISENT_PORT="13444"\nREDISENT_HOST="hostname.cloud.redislabs.com"\n
\nand create a docker compose with section passing .env.gears. Mine looks like this:
\n redisgraph:\n image: redislabs/redismod\n container_name: redisgears\n hostname: redisgears\n env_file:\n - ./.env.gears\n ports:\n - 127.0.0.1:9001:6379\n\n
\nFirst flow:\nWe will be using RedisGears to synchronize all preferences with Redis Enterprise
\nIf you are new to RedisGears, there is a pattern rgsync that covers exacly this use case, but I already have RedisGears, so I am going to build it step by step:
\n# gears_sync_preferences.py\nrconn=None\n\ndef connecttoRedisEnterise():\n import redis \n import os \n log(str(os.environ))\n # Get environment variables\n\n HOST = os.getenv('REDISENT_HOST')\n PASSWORD = os.getenv('REDISENT_PWD')\n PORT = os.getenv('REDISENT_PORT')\n\n redis_client=redis.Redis(host=HOST,port=PORT,charset="utf-8", password=PASSWORD, decode_responses=True)\n return redis_client\n\ndef sync_users(record):\n global rconn\n if not rconn:\n rconn=connecttoRedisEnterise()\n # Uncomment logs to check \n # log(str(record['key']))\n # log(str(record['value']))\n\n rconn.hset(record['key'],mapping=record['value'])\n\ngb = GB()\ngb.foreach(sync_users)\ngb.count()\ngb.run('user_details:*')\n
\nthis is a "batch" mode for RedisGears, which is easier to debug than streams. Install gears-cli](https://github.com/RedisGears/gears-cli) with pip install gears-cli
run above script:
gears-cli run --host 127.0.0.1 --port 9001 gears_sync_preferences.py --requirements req_sync.txt\n
\nwhere req_sync.txt
\nredis==3.5.3\n
\nThis RedisGears will copy all user's profiles into RedisEnterprise. Now let us add sponsors:
\n# gears_sync_sponsors.py\nrconn=None\n\ndef remove_prefix(text, prefix):\n return text[text.startswith(prefix) and len(prefix):]\n\ndef connecttoRedisEnterise():\n import redis \n import os \n log(str(os.environ))\n # Get environment variables\n\n HOST = os.getenv('REDISENT_HOST')\n PASSWORD = os.getenv('REDISENT_PWD')\n PORT = os.getenv('REDISENT_PORT')\n log(HOST)\n log(PORT)\n log(PASSWORD)\n redis_client=redis.Redis(host=HOST,port=PORT,charset="utf-8", password=PASSWORD)\n return redis_client\n\n\n\ndef sync_sponsors(record):\n global rconn\n if not rconn:\n rconn=connecttoRedisEnterise()\n\n log(str(record['key']))\n values=execute('SMEMBERS',record['key'])\n log(str(values))\n for each_value in values: \n rconn.sadd(record['key'],each_value)\n\ngb = GB('KeysReader')\ngb.foreach(sync_sponsors)\ngb.count()\ngb.run('user:*')\n
\nBut this one will sync all user's preferences, but we only need sponsors - let us add another feature of RedisGears - filter:
\nrconn=None\n\ndef remove_prefix(text, prefix):\n return text[text.startswith(prefix) and len(prefix):]\n\ndef connecttoRedisEnterise():\n import redis \n import os \n log(str(os.environ))\n # Get environment variables\n\n HOST = os.getenv('REDISENT_HOST')\n PASSWORD = os.getenv('REDISENT_PWD')\n PORT = os.getenv('REDISENT_PORT')\n\n redis_client=redis.Redis(host=HOST,port=PORT,charset="utf-8", password=PASSWORD, decode_responses=True)\n return redis_client\n\ndef filter_sponsors(record):\n org_name="applied-knowledge-systems"\n user_id = remove_prefix(record['key'],'user:')\n sponsor=execute('SISMEMBER',f'sponsors:{org_name}',user_id)\n return bool(sponsor==True)\n\ndef sync_sponsors(record):\n global rconn\n if not rconn:\n rconn=connecttoRedisEnterise()\n\n log(str(record['key']))\n values=execute('SMEMBERS',record['key'])\n log(str(values))\n for each_value in values: \n rconn.sadd(record['key'],each_value)\n\ngb = GB()\ngb.filter(filter_sponsors)\ngb.foreach(sync_sponsors)\ngb.count()\ngb.run('user:*')\n
\nThen we are going to use Key miss events from Redis Gears to fetch data for all users:
\nand it's very easy, right from key miss example:
\n\ndef fetch_data(r):\n key = r['key']\n global rconn\n if not rconn:\n rconn=connecttoRedisEnterise()\n values=rconn.smembers(record['key'])\n log(str(values))\n for each_value in values: \n execute('SADD',record['key'],each_value)\n\nGB().foreach(fetch_data).register(prefix='user:*', commands=['smember'],eventTypes=['keymiss'], mode="async_local")\n
\nThere is one more option - to turn fetch_data into the async call, by wrapping it into async/await, but Redis Enterprise is fairly fast, and I don't think it's worth adding an async call in this case. For curiosity, see the example code in The Pattern repository.
\nIn this article, we walked through steps on how to create sponsor-specific "nanoservices" using RedisOSS, RedisGears and Redis Enterprise. This allows us to leverage the best of all worlds open source Redis, high availability and persistence with Redis Enterprise and RedisGears as the glue which holds everything together.
\nThis post is in collaboration with Redis.
\nCheck sponsors.go - query whether a GitHub user is your sponsor at a given tier (dollar amount)
\nThere are tools for advanced analytics, including free ones from Google and Kaggle.
\nThere are well-known and validated deployment architectures for applications and the cloud.
\nYet the number of practical applications is still tiny, and they retained niche implementations.\nWhile the benefits of AI are clear, there are still many gaps in AI architecture that need to be filled. For example, there is a gap between analytical tools and verified architectures for real-time deployments. This gap often stems from a lack of specific reference architectures and patterns, demonstrating the trade-offs between technologies, libraries, and tools.
\nLet's bridge the gap in knowledge and drive a connection between science and engineering to make fast, efficient, and practical AI deployments.\nThree things need to be in place to build an AI product:
\nI will use The Pattern, my [“Build on Redis” Hackathon prize-winning open source](https://github.com/applied-knowledge-systems/the-pattern) project, to illustrate how the capabilities below can be implemented and invite you to contribute or donate.
\nWe launch in two full-featured articles - NLP ML pipeline for turning unstructured JSON text into a knowledge graph and fresh off the press Benchmarks for BERT Large Question Answering inference for RedisAI and RedisGears with Grafana Dashboards by Mikhail Volkov
\n" }]