Knowledge Graph Extraction & Visualization with local LLM from Unstructured Text: a History example

Published in

Generative AI

19 min readApr 1, 2024

Motivation and context

As usual, the way is the destination. So before entering the code part, I enter into the details of the WHY I have written this article. If you are just interested in the result and the code, feel free to skip forward to “ The Code / Execution Details” (and the notebook on GitHub). But you’ll miss important context information …. which makes a perfect transition to the graph subject:

Why Graphs?

I already worked with graphs before, in the context of a more flexible visualization of organisational realities beyond pure hierarchy.

This interest was renewed after toying around with RAG (retrieval augmented generation). Graphs as a basis for RAG become interesting in use cases where purely statistical word embeddings are not good enough and are conducive to having the LLM produce hallucination.

Graphs provide a safer ground for RAG by providing better context because they encapsulate the semantics contained in the text and are a good way to foster understanding through (a) potential visualization of semantic relationships and (b) re-use as input for graph embeddings (vs. pure word embeddings).

However, full RAG is a second step. This article is limited to creating the graph from unstructured text. Embedding graphs and using them for semantic-driven RAG is out of scope and reserved for a future article.

Why a local LLM?

I had never run an LLM locally — but clearly saw the interest and need:

Privacy and confidentiality, e.g. when treating company information that cannot be transmitted to uncontrolled third parties. The example in this code is one that would not require this level of confidentiality. But it serves as a proof of concept that shows a solution that is also fit e.g. to extract knowledge from confidential company documents, as this info does not leave your computer.
Use and support open source models to foster informational autarchy and independence from “winner-takes-it-all” companies. If you are interested in why this is important, I recommend reading this article from Daniel Jeffries (and eventually subscribing to his substack channel).
Converging hardware and software evolution, making it possible to run high-quality LLMs on (comparably) affordable home solutions.

The concrete choice fell on combining Ollama, Mixtral 8x7B Mixture-of-Expert model on Macbook M1 Pro. Ollama promised to provide an easy way to make LLMs run locally in a very easy way. And I was not disappointed — it’s truly amazingly simple.

I heard that Mixtral shows VERY good performance, at par e.g. with chatGPT 3.5. Calculation and feedback from the internet suggested the model must be able to run on a Mac with 64 GB RAM. The price drop of used M1 hardware after the introduction of M3 processors made this choice available to me by only “sacrificing” 25% — 30% of performance vs. the new M3 benchmark at less than 50% of the price — and performance (as I found out later) still largely sufficient for own use cases, even though the tasks are VERY calculation intensive.

Incidentally, the calculation intensity of the task would also drive up the costs of any 3rd-part API use to much higher levels — so even when I haven’t done the math yet, I assume that the investment in my own hardware is also cost-efficient in the end. This, of course, is under the assumption that you actually use it in the long run. But graph extraction is not the only use case. I also see an ample field for local agents to provide support on daily tasks. How this can look in concrete terms, I already tried out with the “career coach agent” — just that the code, at the time, still relied on the OpenAI API.

Next steps:

As mentioned above, both the knowledge extraction and the use of a local LLM lend themselves to more experiments beyond the scope of this article (…but further articles on these possibilities may eventually follow).

For the use of the extracted graph, this means primarily the use as a basis for improved, semantic-based RAG.
Additional uses for running an LLM locally are (1) the possibility of finetuning the open-source models with their own data, helping the LLM to provide better answers that are more relevant to address the given use case for which data is available, and (2) the mentioned possibility to run agent frameworks locally.

Details on the use case: History of the German podcast transcripts

As a first use case, I targeted to extract a knowledge graph from unstructured text from my currently favorite podcast “The History of the Germans” by Dirk Hoffmann-Becking: a sincere and true recommendation for any History buff.

History of the Germany Podcast on Spotify

I already scraped the podcast transcript from the excellent associated website, with one large text corpus per period (e.g. “Ottonians”, “Salians”, “Hohenstaufen” etc.). However, for reasons explained below, this example only works on the transcript of a single episode.

History texts show very well the shortcomings of “embedding only” RAG, motivating the interest in a more semantic-driven knowledge-graph-based RAG to query the text (see: “Next steps” above).

The proof for this: I already created a GPT based on the transcript corpus. But querying the text afterward showed very mixed results.

Chronology and relationships are obviously very important concepts in a History text — but word embeddings do not capture well these concepts.

A good example: as strange as it may sound nowadays, in the periods covered in the corpus, excommunication of emperors through popes was a powerful political tool that was put to use on a regular basis (…and any self-respecting emperor wouldn’t suffer NOT being excommunicated at least once…). But it matters, of course, if Pope P1 excommunicates Emperor E1 or Emperor E2. Especially if Emperor E2 happens to be the great-grandson of Emperor E1 and Pope P1 was already pushing up daisies decades before the rule of Emperor E2 started.

Word embeddings do capture well the relationship “Popes excommunicate Emperors”…. but they start too quickly hallucinating the corresponding names (e.g.: if Pope P1 excommunicated Emperor E1, why shouldn’t he have excommunicated Emperor E2 ?). Precisely because the embeddings cannot expressly capture the chronological or relational aspects linking the words they embed.

Establishing this link literally means establishing a graph. In a knowledge graph representation, there would only be “edges” from Pope P1 to Emperor E1, not to E2 (because they are separated by a lifetime, preventing any co-occurrences).

Voilà, my example of why I want to test out a knowledge-graph-based RAG

As a first step, this means being able to establish this graph (and visualize it, as visualization is a very good way to understand)

In the concrete example on GitHub, the code uses the transcript of Episode 96 “Saxony and Eastwards Expansion: Meet the Neighbors”.

BTW, a nice trivia addition to your “encyclopedia of useless knowledge”: in this episode, the Saxons encounter, amongst others, Harald Bluetooth, the namesake of today’s Bluetooth technology (the name was chosen, if I remember correctly because it was a joint innovation of Nokia and Ericsson…. and King Harald Bluetooth happens to be the one who first succeded in the difficult endeavor to unify Swedes and Norwegian (well, or at least their predecessors) :-)

The Hackathon

So that was my intent… what was missing was the opportunity. Which came around in the form of a Python Hackathon of the Düsseldorf Python User Group PyDDF, organized, chaired, and driven by Marc-André Lemburg and Charly Clark (amongst other things maintainer of the openpyxl package). If you are interested in more info on this group, consult their webpage or their YouTube channel.

Prior to the Hackathon weekend, I did some research and found, incidentally, an excellent article here on Medium that had the potential to take me 80%-90% of the way to my objective.

So the task for the hackathon was to understand and modify the code from this article to be able to extract a knowledge graph, encapsulating semantic information, from parts of the “History of the Germans” podcast transcript to serve as future input for a graph-based RAG chat with this content.

The article that inspired it all — and changes to it

As said, the inspirational article I found provides an excellent basis, showing how to achieve precisely the initial intent: extract and visualize a knowledge graph from unstructured text:

“How to convert any text into a graph of concepts” by Rahul Nayak

The changes and modifications done to the code from this example were mainly:

Transformation to a single All-in-One notebook
Use of different LLM (using now the stronger Mixtral model)
Elimination of some seemingly unused function definitions from the code base
Making relevant adjustments to the SYS_PROMPT due to the History use case

The last point taught me a lot about prompting: the SYS_PROMPT is the real prompt, whereas USER_PROMPT is actually less of a prompt, but rather (comparable to RAG) context-info that the SYS_PROMPT performs tasks on.

This SYS_PROMPT needed careful revision according to the altered use case: The inspirational article focused on articles on the health system in India. That is a domain quite different from medieval German history. A first run yielded disappointing results… until I checked every line in the instructions contained in the SYS_PROMPT: e.g. the identification of persons as entities was expressly excluded from the concept extraction prompt. Which produces quite some limitations for texts covering History. Results improved a lot after adjusting the SYS_PROMPT to the covered field of History, focusing particularly on persons as agents or entities.

The SYS_PROMPT is also a good entry point into the understanding of how far LLM-based processing is different from “classical” programming: even though the instructions of the SYS_PROMPT used are clear, they do NOT produce a correct JSON output format invariably every time. One needs to check the quality of the output manually (aka the number of chunks that produce an error when trying to load the JSON string from the LLM-prompt-call to the result list). Skipping a chunk every now and then shouldn’t be too problematic, but if the ratio of successful to unsuccessful transformations from text-chunk to JSON format becomes too low, one should probably either work on the text input or start to modify and improve the SYS_PROMPT.

To change the LLM may come across as overkill. It would need to be tested if a smaller, more focused model wouldn’t show better efficiency. But applying the “Why do the chickens cross the road” logic (Answer: because they can !), I opted for the highest performance model that runs on the hardware described above. And that happened to be Mixtral.

The Code / Execution Details

Setting things up and Imports

Importing the usual suspects, aka packages to get the job done. The packages to establish and visualize the graph are imported later.

The UUID package serves to give each chunk a unique ID — which is important for the later self-join (to create edges between concepts that co-occur in the same chunk).

Ollama is set up as a client. The precise model to be called by Ollama (Mixtral in this case) is defined later.

# ## Setup
import pandas as pd
import numpy as np
import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import random

# packages used by helper functions
import uuid


# packages for prompting definitions
import sys
sys.path.append("..")
import json

# Setting up Ollama as LLM
from langchain_community.llms import Ollama
import ollama
from ollama import Client
client = Client(host='http://localhost:11434')

The “graphPrompt” function

The most important function in this code. It serves to extract so-called triplets (node — edge — node) from the text chunk served to the function. These triplets are returned in the format of a JSON file and represent the semantic information contained in the text chunk. In the historical context of this example (and as you can see from the result file), this mostly boils down to the description “Actor A did this and that to or with Actor B”. Where Actor A is Node1 of the triplet, Actor B is Node2 of the triplet, and “this and that” describes the relationship between the two, thereby making this the Edge of the triplet.

In this function, Mixtral is defined as the model to be used by Ollama. Sure thing: the model must be downloaded first to be available to Ollama. A good description of how to do this can be found here: https://medium.com/llamaindex-blog/running-mixtral-8x7-locally-with-llamaindex-e6cebeabe0ab

As already mentioned: the SYS_PROMPT defined here is VERY important to achieve the main goal: it forces Mixtral to extract the semantic relationships in the text chunk and return them in a precisely defined JSON format (…which does not always work, as mentioned). I felt very lucky that Rahul Nayak, the author of the inspirational article, knew what he was doing. I don’t think I would have been able to come up with this. But, as explained above, it was still necessary to adjust the prompt: other things are more relevant in articles on the Indian health system (the context treated in the inspirational article) than in the transcript of a History podcast.

BTW, the USER_PROMPT is simply the text chunk to be processed.

As you can see from the code, I identified two problems when working with the real text: the chunk ID was not always correctly included in the output JSON. So I reverted to a rather unconventional “If it's stupid but it works, it ain’t stupid” solution.

I saw from test runs that Mixtral somehow tended to insert escape sequences of different lengths plus some additional information before it started with the list containing the JSON feedback. Both problems are fixed before the result list is returned.

Finally: I also print out the JSON-string (regardless if correct or incorrect) to be able to monitor the processing status.

#################################
# Definition of used LLM
#################################
##########################################################################
def graphPrompt(input: str, metadata={}, model="mixtral:latest"):
    if model == None:
        model = "mixtral:latest"
    
    chunk_id = metadata.get('chunk_id', None)

    # model_info = client.show(model_name=model)
    # print( chalk.blue(model_info))

    SYS_PROMPT = ("You are a network graph maker who extracts terms and their relations from a given context. "
        "You are provided with a context chunk (delimited by ```) Your task is to extract the ontology "
        "of terms mentioned in the given context. These terms should represent the key concepts as per the context. \n"
        "Thought 1: While traversing through each sentence, Think about the key terms mentioned in it.\n"
            "\tTerms may include person (agent), location, organization, date, duration, \n"
            "\tcondition, concept, object, entity  etc.\n"
            "\tTerms should be as atomistic as possible\n\n"
        "Thought 2: Think about how these terms can have one on one relation with other terms.\n"
            "\tTerms that are mentioned in the same sentence or the same paragraph are typically related to each other.\n"
            "\tTerms can be related to many other terms\n\n"
        "Thought 3: Find out the relation between each such related pair of terms. \n\n"
        "Format your output as a list of json. Each element of the list contains a pair of terms"
        "and the relation between them like the follwing. NEVER change the value of the chunk_ID as defined in this prompt: \n"
        "[\n"
        "   {\n"
        '       "chunk_id": "CHUNK_ID_GOES_HERE",\n'
        '       "node_1": "A concept from extracted ontology",\n'
        '       "node_2": "A related concept from extracted ontology",\n'
        '       "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences"\n' 
        "   }, {...}\n"
        "]"
    )
    SYS_PROMPT = SYS_PROMPT.replace('CHUNK_ID_GOES_HERE', chunk_id)

    USER_PROMPT = f"context: ```{input}``` \n\n output: "

    response = client.generate(model="mixtral:latest", system=SYS_PROMPT, prompt=USER_PROMPT)

    aux1 = response['response']
    # Find the index of the first open bracket '['
    start_index = aux1.find('[')
    # Slice the string from start_index to extract the JSON part and fix an unexpected problem with insertes escapes (WHY ?)
    json_string = aux1[start_index:]
    json_string = json_string.replace('\\\\\_', '_')
    json_string = json_string.replace('\\\\_', '_')
    json_string = json_string.replace('\\\_', '_')
    json_string = json_string.replace('\\_', '_')
    json_string = json_string.replace('\_', '_')
    json_string.lstrip() # eliminate eventual leading blank spaces
#####################################################
    print("json-string:\n" + json_string)
#####################################################         
    try:
        result = json.loads(json_string)
        result = [dict(item) for item in result]
    except:
        print("\n\nERROR ### Here is the buggy response: ", response, "\n\n")
        result = None
    print("§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§")

    return result

Other helper functions

documents2Dataframe: creates a dataframe from the chunks of the input text document; here, the UUID (unique identifier) is added to clearly differentiate every chunk “ripped” from the original text — which is important for further processing. Incidentally, I think that using the index of the dataframe would have been sufficient…. but what do they say: “Never touch a running system!”
df2Graph: a sort of wrapper function for the more important graphPrompt()-function; this function applies the graphPrompt()-function to every row (and thus text chunk) of the dataframe created with the former documents2Dataframe()-function
graph2DF: the “reverse function. This function takes the list of JSON triplets that contain the semantic information that Mistral has extracted from the text chunk(s) and transforms them into a dataframe.
contextual_proximity: the inspirational article linked above explains well what this function does. It’s essentially a self-join that identifies the number of co-occurences of concepts in a given chunk. The assumption is that the more there are co-occurences, the more relevant the relationship (and thus the semantic meaning) between the two linked concepts.

# ## Functions
def documents2Dataframe(documents) -> pd.DataFrame:
    rows = []
    for chunk in documents:
        row = {
            "text": chunk.page_content,
            **chunk.metadata,
            "chunk_id": uuid.uuid4().hex,
        }
        rows = rows + [row]

    df = pd.DataFrame(rows)
    return df

def df2Graph(dataframe: pd.DataFrame, model=None) -> list:
    # dataframe.reset_index(inplace=True)
    results = dataframe.apply(
        lambda row: graphPrompt(row.text, {"chunk_id": row.chunk_id}, model), axis=1
    )
    # invalid json results in NaN
    results = results.dropna()
    results = results.reset_index(drop=True)

    ## Flatten the list of lists to one single list of entities.
    concept_list = np.concatenate(results).ravel().tolist()
    return concept_list

def graph2Df(nodes_list) -> pd.DataFrame:
    ## Remove all NaN entities
    graph_dataframe = pd.DataFrame(nodes_list).replace(" ", np.nan)
    graph_dataframe = graph_dataframe.dropna(subset=["node_1", "node_2"])
    graph_dataframe["node_1"] = graph_dataframe["node_1"].apply(lambda x: x.lower())
    graph_dataframe["node_2"] = graph_dataframe["node_2"].apply(lambda x: x.lower())
    return graph_dataframe

def contextual_proximity(df: pd.DataFrame) -> pd.DataFrame:
    ## Melt the dataframe into a list of nodes
    dfg_long = pd.melt(
        df, id_vars=["chunk_id"], value_vars=["node_1", "node_2"], value_name="node"
    )
    dfg_long.drop(columns=["variable"], inplace=True)
    # Self join with chunk id as the key will create a link between terms occuring in the same text chunk.
    dfg_wide = pd.merge(dfg_long, dfg_long, on="chunk_id", suffixes=("_1", "_2"))
    # drop self loops
    self_loops_drop = dfg_wide[dfg_wide["node_1"] == dfg_wide["node_2"]].index
    dfg2 = dfg_wide.drop(index=self_loops_drop).reset_index(drop=True)
    ## Group and count edges.
    dfg2 = (
        dfg2.groupby(["node_1", "node_2"])
        .agg({"chunk_id": [",".join, "count"]})
        .reset_index()
    )
    dfg2.columns = ["node_1", "node_2", "chunk_id", "count"]
    dfg2.replace("", np.nan, inplace=True)
    dfg2.dropna(subset=["node_1", "node_2"], inplace=True)
    # Drop edges with 1 count
    dfg2 = dfg2[dfg2["count"] != 1]
    dfg2["edge"] = "contextual proximity"
    return dfg2

The variables for input and output

This part defines the subdirectories and precise file names that both the input is taken from and results are written to. The separate “visualization only” notebook uses the same convention so that it can read directly read the results produced from this code.

Obviously, this is the part that needs to be modified to the actual data source if applied to other use cases.

# ## Variables
## Input data directory
##########################################################
input_file_name = "Saxony_Eastern_Expansion_EP_96.txt"
##########################################################
data_dir = "HotG_Data/"+input_file_name
inputdirectory = Path(f"./{data_dir}")

## This is where the output csv files will be written
outputdirectory = Path(f"./data_output")

output_graph_file_name = f"graph_{input_file_name[:-4]}.csv"
output_graph_file_with_path = outputdirectory/output_graph_file_name

output_chunks_file_name = f"chunks_{input_file_name[:-4]}.csv"
output_chunks_file_with_path = outputdirectory/output_chunks_file_name

output_context_prox_file_name = f"graph_contex_prox_{input_file_name[:-4]}.csv"
output_context_prox_file_with_path = outputdirectory/output_context_prox_file_name

The rest of the code is very much like in the inspirational article:

Loading and chunking the source document

# ## Load Document

#loader = TextLoader("./HotG_Data/Hanse.txt")
loader = TextLoader(inputdirectory)
Document = loader.load()
# clean unnecessary line breaks
Document[0].page_content = Document[0].page_content.replace("\n", " ")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

pages = splitter.split_documents(Document)
print("Number of chunks = ", len(pages))
print(pages[5].page_content)

Create a dataframe from the chunks

# ## Create a dataframe of all the chunks
df = documents2Dataframe(pages)
print(df.shape)
df.head()

Extract Concepts: the core piece of code!

df2Graph is called on the entire dataframe containing the text chunks; as said above, df2Graph applies the graphPrompt()-function to every chunk in the dataframe. And this graphPrompt()-function performs the actual “knowledge extraction” from the text chunk, based on the instructions given in the SYS_PROMPT.

Both the dataframe with the chunked text and especially the graph dataframe containing the retrieved triplets are saved to the previously defined output directory, to avoid having to re-create the info for simple visualization.

# ## Extract Concepts
## To regenerate the graph with LLM, set this to True
##################
regenerate = False  # toggle to True if the time-consuming (re-)generation of the knowlege extraction is required
##################
if regenerate:
#########################################################    
    concepts_list = df2Graph(df, model='mixtral:latest')
#########################################################
    dfg1 = graph2Df(concepts_list)
    
    if not os.path.exists(outputdirectory):
        os.makedirs(outputdirectory)
    
    dfg1.to_csv(output_graph_file_with_path, sep=";", index=False)
    df.to_csv(output_chunks_file_with_path, sep=";", index=False)
else:
    dfg1 = pd.read_csv(output_graph_file_with_path, sep=";")

dfg1.replace("", np.nan, inplace=True)
dfg1.dropna(subset=["node_1", "node_2", 'edge'], inplace=True)
dfg1['count'] = 4 
## Increasing the weight of the relation to 4. 
## We will assign the weight of 1 when later the contextual proximity will be calculated.  
print(dfg1.shape)
dfg1.head()

Calculating contextual proximity

As said above: this part of the code identifies the number of co-occurences of concepts in a given chunk. The assumption is that the more there are co-occurences, the more relevant the relationship (and thus the semantic meaning) between the two linked concepts.

Mind that also the contextual proximity dataframe is saved as CSV to the defined output directory.

# ## Calculating contextual proximity
dfg2 = contextual_proximity(dfg1)
dfg2.to_csv(output_context_prox_file_with_path, sep=";", index=False)
dfg2.tail()#

# ### Merge both the dataframes
dfg = pd.concat([dfg1, dfg2], axis=0)
dfg = (
    dfg.groupby(["node_1", "node_2"])
    .agg({"chunk_id": ",".join, "edge": ','.join, 'count': 'sum'})
    .reset_index()
)

The Graph Visualization Part

This part of the code remained unchanged vs. the inspirational article. Again: I am most happy that Rahul Nayak obviously knew what he was doing!

Changes here can add interesting new aspects. E.g. I assume that there are other algorithms of the community generation (the code uses Girvan-Newman), eventually better adapted to the use case. So here, again, is an ample field for experimentation.

Instantiate the NetworkX graph object

# ## Calculate the NetworkX Graph
nodes = pd.concat([dfg['node_1'], dfg['node_2']], axis=0).unique()
nodes.shape

import networkx as nx
G = nx.Graph()

## Add nodes to the graph
for node in nodes:
    G.add_node(
        str(node)
    )
## Add edges to the graph
for index, row in dfg.iterrows():
    G.add_edge(
        str(row["node_1"]),
        str(row["node_2"]),
        title=row["edge"],
        weight=row['count']/4
    )

Calculate Communities

# ### Calculate communities for coloring the nodes
communities_generator = nx.community.girvan_newman(G)
top_level_communities = next(communities_generator)
next_level_communities = next(communities_generator)
communities = sorted(map(sorted, next_level_communities))
print("Number of Communities = ", len(communities))
print(communities)

Prepare data to enhance the graph with color code information and add color information to the graph

A color code is applied to the graph nodes based on the previously calculated community membership of each node.

# ### Create a dataframe for community colors
import seaborn as sns
palette = "hls"
## Now add these colors to communities and make another dataframe
def colors2Community(communities) -> pd.DataFrame:
    ## Define a color palette
    p = sns.color_palette(palette, len(communities)).as_hex()
    random.shuffle(p)
    rows = []
    group = 0
    for community in communities:
        color = p.pop()
        group += 1
        for node in community:
            rows += [{"node": node, "color": color, "group": group}]
    df_colors = pd.DataFrame(rows)
    return df_colors

colors = colors2Community(communities)
colors

# ### Add colors to the graph
for index, row in colors.iterrows():
    G.nodes[row['node']]['group'] = row['group']
    G.nodes[row['node']]['color'] = row['color']
    G.nodes[row['node']]['size'] = G.degree[row['node']]

Instantiate the pyviz network object and show the resulting graph

from pyvis.network import Network
net = Network(
    notebook=True,
    # bgcolor="#1a1a1a",
    cdn_resources="remote",
    height="800px",
    width="100%",
    select_menu=True,
    # font_color="#cccccc",
    filter_menu=False,
)
net.from_nx(G)
net.force_atlas_2based(central_gravity=0.015, gravity=-31)

net.show_buttons(filter_=['physics'])
net.show("knowledge_graph.html")

Finally, you should have something like this as a result:

The graph is interactive: you can zoom in and out, drag nodes to different places, etc. The perfect entry point for the intended purpose: intuitive knowledge discovery and exploration! Are there relationships I haven’t seen before? Are there “surprise connections”? Are there things missing I considered important before? etc. etc.

But before you get carried away by the graph itself, please also mind the piece of code in the code block above:

net.show_buttons(filter_=['physics'])

It took me a while to notice: but this adds an interactive control field BELOW the graph that allows you to adjust the physical behaviour of the graph visualization. One “only” needs to know AND scroll down to see it. This adds yet another dimension to the exploration possibilities. And just to make sure you know what you have to look for:

Button field to control the graph visualization physics

So far the code: if you are interested, here again is the link to the corresponding GitHub repository:

GitHub - syrom/LocalKnowledgeGraphExtraction: Code to extract Knowledge Graph from normal…

Code to extract Knowledge Graph from normal, unstructured text and visualize the resulting graph …

github.com

Epilogue:

Ollama Bugs

The entire field is still very new — and new software is often still in the experimental stage. This also applies e.g. to Ollama. I tried to run the code above initially (overnight) on the transcripts covering entire historical periods (aka dynasties like Salian, Hohenstaufen, etc.) — and thus up to 40 episodes in one go. But this wouldn’t work out because Ollama would simply, at one point, stop generating responses to the calls that the code made to Mixtral.

This bug seemed to be related to some memory overflow or leakage because it happened after a rather constant number of generations (taking a text chunk and generating a JSON-format from it)

This bug was identified and flagged on GitHub ….. and partially fixed as of today’s Ollama update (2024–03–29):

https://github.com/ollama/ollama/issues/1863#issuecomment-2010185380

After this update, it was possible for the first time to have the code below churn through a large text with, in the given case> 100 chunks of a chunk size of 1.000 characters (with 100 characters overlap).

Unfortunately, with chunk sizes > 120, I still ran inevitably into the stalling of the LLM-call: the code execution would simply stop and not return any results anymore, even though the kernel was still active. This is still good enough, though, to process the transcripts of roughly 3 podcast episodes in one batch (but, as mentioned, the GitHub example only uses the text of a single episode to make sure that it truly works).

This problem is certainly due to the novelty of all the tools used — and may or may not go away completely with further updates.

Performance

In case you believe that local generation is done in a breeze: think again!

The performance of the knowledge extraction process is slow on a local machine (MacBook M1 Pro). Which shows how much is going on under the hood. I counted processing times of 30 sec to roughly less than a minute per chunk to produce the JSON-string with an average of around 40 sec. So a text of ca. 100 chunks (aka 100.000 character length based on a chunk size of 1000) requires over an hour of processing time to extract the knowledge graph. Plus: you better not detach the power cord. The otherwise extremely frugal MacBook starts to consume electricity like hell once the script gets going.

Hence, the code also saves the result in several forms as CSV files. The knowledge graph can thus later be reproduced faster, once the extraction has taken place, simply by loading the files containing the results from the extraction process. Or the output can be used as RAG-input in a 2nd step.

As said before: there is a dedicated notebook just for reproducing the knowledge graph from saved files on GitHub, skipping the time and energy-intense extraction part.

…and if you liked the text and/or found it useful, be sure to leave an encouraging clap :-)

This story is published under Generative AI Publication.

Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!