Close Menu
    Trending
    • Routing in a Sparse Graph: a Distributed Q-Learning Approach
    • SMART launches new Wearable Imaging for Transforming Elderly Care research group | MIT News
    • YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger
    • Creating a Data Pipeline to Monitor Local Crime Trends
    • The Proximity of the Inception Score as an Evaluation Criterion
    • GPTHuman vs HIX Bypass: AI Humanizer Showdown
    • How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance
    • What we’ve been getting wrong about AI’s truth crisis
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Routing in a Sparse Graph: a Distributed Q-Learning Approach
    Artificial Intelligence

    Routing in a Sparse Graph: a Distributed Q-Learning Approach

    ProfitlyAIBy ProfitlyAIFebruary 3, 2026No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    concerning the Small-World Experiment, carried out by Stanley Milgram within the 1960’s. He devised an experiment by which a letter was given to a volunteer particular person in the US, with the instruction to ahead the letter to their private contact almost definitely to know one other particular person – the goal – in the identical nation. In flip, the particular person receiving the letter could be requested to ahead the letter once more till the goal particular person was reached. Though many of the letters by no means reached the goal particular person, amongst people who did (survivorship bias at play right here!), the common variety of hops was round 6. The “six levels of separation” has turn out to be a cultural reference to the tight interconnectivity of society.

    I’m nonetheless amazed by the thought that folks with ~102 contacts handle to attach with a random particular person in a community of ~108 folks, by way of a couple of hops.

    How is that attainable? Heuristics.

    Let’s assume I’m requested to ship a letter to a goal particular person in Finland1.

    Sadly, I don’t have any contacts in Finland. Then again, I do know somebody who lived in Sweden for a few years. Maybe she is aware of folks in Finland. If not, she most likely nonetheless has contacts in Sweden, which is a neighboring nation. She is my finest wager to get nearer to the goal particular person. The purpose is, though I have no idea the topology of the social community past my very own private contacts, I can use guidelines of thumb to ahead the letter in the precise course.

    Hi there, Finland! Picture from Illia Panasenko, on Unsplash.

    If we undertake the perspective of the community’s nodes (the people concerned within the experiment), their obtainable actions are to ahead the message (the letter) to one in all their outgoing edges (private contacts). This drawback of transmitting the message in the precise course provides a possibility to have enjoyable with machine studying!

    Nodes don’t understand the entire community topology. We will arrange an surroundings that rewards them for routing the message alongside the shortest recognized path, whereas encouraging them to discover sub-optimal candidate paths. That appears like an incredible use case for reinforcement studying, don’t you assume?

    For these focused on operating the code, you may attain the repo here.

    The Drawback

    We’ll be given a directed graph the place edges between nodes are sparse, which means the common variety of outgoing edges from a node is considerably smaller than the variety of nodes. Moreover, the perimeters could have an related price. This extra characteristic generalizes the case of the Small-World Experiment, the place every hop of the letter counted for a value of 1.

    The issue we’ll take into account is to design a reinforcement studying algorithm that finds a path from an arbitrary begin node to an arbitrary goal node in a sparse directed graph, with a value as little as attainable, if such a path exists. Deterministic options exist for this drawback. For instance, Dijkstra’s algorithm finds the shortest path from a begin node to all the opposite nodes in a directed graph. This will probably be helpful to judge the outcomes of our reinforcement studying algorithm, which isn’t assured to seek out the optimum answer.

    Q-Studying

    Q-Learning is a reinforcement studying approach the place an agent maintains a desk of state-action pairs, related to the anticipated discounted cumulative reward – the high quality, therefore the Q-Studying. By way of iterative experiments, the desk is up to date till a stopping criterion is met. After coaching, the agent can select the motion (the column of the Q-matrix) comparable to the maximal high quality, for a given state (the row of the Q-matrix).

    The replace rule, given a trial motion aj, ensuing within the transition from state si to state sokay, a reward r, and a finest estimate of the standard of the subsequent state sokay, is:

    [ Q(i, j) leftarrow (1 – alpha) Q(i, j) + alpha left( r + gamma max_{l} Q(k, l) right) ]

    Equation 1: The replace rule for Q-Studying.

    In equation 1:

    • α is the training price, controlling how briskly new outcomes will erase outdated high quality estimates.
    • γ is the low cost issue, controlling how a lot future rewards examine with instant rewards.
    • Q is the standard matrix. The row index i is the index of the origin state, and the column index j is the index of the chosen motion.

    Briefly, equation 1 states that the standard of (state, motion) must be partly up to date with a brand new high quality worth, manufactured from the sum of the instant reward and the discounted estimate of the subsequent state’s most high quality over attainable actions.

    For our drawback assertion, a attainable formulation for the state could be the pair (present node, goal node), and the set of actions could be the set of nodes. The state set would then include N2 values, and the motion set would include N values, the place N is the variety of nodes. Nevertheless, as a result of the graph is sparse, a given origin node has solely a small subset of nodes as outgoing edges. This formulation would end in a Q-matrix the place the big majority of the N3 entries are by no means visited, consuming reminiscence needlessly.

    Distributed brokers

    A greater use of assets is to distribute the brokers. Every node might be regarded as an agent. The agent’s state is the goal node, therefore its Q-matrix has N rows and Nout columns (the variety of outgoing edges for this particular node). With N brokers, the whole variety of matrix entries is N2Nout, which is decrease than N3.

    To summarize:

    • We’ll be coaching N brokers, one for every node within the graph.
    • Every agent will study a Q-matrix of dimensions [N x Nout]. The variety of outgoing edges (Nout) can fluctuate from one node to a different. For a sparsely linked graph, Nout << N.
    • The row indices of the Q-matrix correspond to the state of the agent, i.e., the goal node.
    • The column indices of the Q-matrix correspond to the outgoing edge chosen by an agent to route a message towards the goal node.
    • Q[i, j] represents a node’s estimate of the standard of forwarding the message to its jth outgoing edge, given the goal node is i.
    • When a node receives a message, it would embody the goal node. Because the sender of the earlier message will not be wanted to find out the routing of the subsequent message, it is not going to be included within the latter.

    Code

    The core class, the node, will probably be named QNode.

    class QNode:
        def __init__(self, number_of_nodes=0, connectivity_average=0, connectivity_std_dev=0, Q_arr=None, neighbor_nodes=None,
                     state_dict=None):
            if state_dict will not be None:
                self.Q = state_dict['Q']
                self.number_of_nodes = state_dict['number_of_nodes']
                self.neighbor_nodes = state_dict['neighbor_nodes']
            else:  # state_dict is None
                if Q_arr is None:
                    self.number_of_nodes = number_of_nodes
                    number_of_neighbors = connectivity_average + connectivity_std_dev * np.random.randn()
                    number_of_neighbors = spherical(number_of_neighbors)
                    number_of_neighbors = max(number_of_neighbors, 2)  # No less than two out-connections
                    number_of_neighbors = min(number_of_neighbors, self.number_of_nodes)  # No more than N connections
                    self.neighbor_nodes = random.pattern(vary(self.number_of_nodes), number_of_neighbors)  # [1, 4, 5, ...]
                    self.Q = np.zeros((self.number_of_nodes, number_of_neighbors))  # Optimistic initialization: all rewards will probably be adverse
                    # q = self.Q[state, action]: state = goal node; motion = chosen neighbor node (transformed to column index) to route the message to
    
                else:  # state_dict is None and Q_arr will not be None
                    self.Q = Q_arr
                    self.number_of_nodes = self.Q.form[0]
                    self.neighbor_nodes = neighbor_nodes

    The category QNode has three fields: the variety of nodes within the graph, the checklist of outgoing edges, and the Q-matrix. The Q-matrix is initialized with zeros. The rewards obtained from the surroundings would be the adverse of the sting prices. Therefore, the standard values will all be adverse. For that reason, initializing with zeros is an optimistic initialization.

    When a message reaches a QNode object, it selects one in all its outgoing edges by way of the epsilon-greedy algorithm. If ε is small, the epsilon-greedy algorithm selects more often than not the outgoing edge with the very best Q-value. From time to time, it would choose an outgoing edge randomly:

    def epsilon_greedy(self, target_node, epsilon):
            rdm_nbr = random.random()
            if rdm_nbr < epsilon:  # Random alternative
                random_choice = random.alternative(self.neighbor_nodes)
                return random_choice
            else:  # Grasping alternative
                neighbor_columns = np.the place(self.Q[target_node, :] == self.Q[target_node, :].max())[0]  # [1, 4, 5]
                neighbor_column = random.alternative(neighbor_columns)
                neighbor_node = self.neighbor_node(neighbor_column)
                return neighbor_node

    The opposite class is the graph, known as QGraph.

    class QGraph:
        def __init__(self, number_of_nodes=10, connectivity_average=3, connectivity_std_dev=0, cost_range=[0.0, 1.0],
                     maximum_hops=100, maximum_hops_penalty=1.0):
            self.number_of_nodes = number_of_nodes
            self.connectivity_average = connectivity_average
            self.connectivity_std_dev = connectivity_std_dev
            self.cost_range = cost_range
            self.maximum_hops = maximum_hops
            self.maximum_hops_penalty = maximum_hops_penalty
            self.QNodes = []
            for node in vary(self.number_of_nodes):
                self.QNodes.append(QNode(self.number_of_nodes, self.connectivity_average, self.connectivity_std_dev))
    
            self.cost_arr = cost_range[0] + (cost_range[1] - cost_range[0]) * np.random.random((self.number_of_nodes, self.number_of_nodes))
    

    Its principal fields are the checklist of nodes and the array of edge prices. Discover that the precise edges are a part of the QNode class, as a listing of outgoing nodes.

    Once we need to generate a path from a begin node to a goal node, we name the QGraph.trajectory() technique, which calls the QNode.epsilon_greedy() technique:

        def trajectory(self, start_node, target_node, epsilon):
            visited_nodes = [start_node]
            prices = []
            if start_node == target_node:
                return visited_nodes, prices
            current_node = start_node
            whereas len(visited_nodes) < self.maximum_hops + 1:
                next_node = self.QNodes[current_node].epsilon_greedy(target_node, epsilon)
                price = float(self.cost_arr[current_node, next_node])
                visited_nodes.append(next_node)
                prices.append(price)
                current_node = next_node
                if current_node == target_node:
                    return visited_nodes, prices
            # We reached the utmost variety of hops
            return visited_nodes, prices

    The trajectory() technique returns the checklist of visited nodes alongside the trail and the checklist of prices related to the perimeters that had been used.

    The final lacking piece is the replace rule of equation 1, applied by the QGraph.update_Q() technique:

    def update_Q(self, start_node, neighbor_node, alpha, gamma, target_node):
       price = self.cost_arr[start_node, neighbor_node]
       reward = -cost
       # Q_orig(goal, dest) <- (1 - alpha) Q_orig(goal, dest) + alpha * ( r + gamma * max_neigh' Q_dest(goal, neigh') )
       Q_orig_target_dest = self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)]
       max_neigh_Q_dest_target_neigh = np.max(self.QNodes[neighbor_node].Q[target_node, :])
       updated_Q = (1 - alpha) * Q_orig_target_dest + alpha * (reward + gamma * max_neigh_Q_dest_target_neigh)
       self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)] = updated_Q

    To coach our brokers, we iteratively loop by way of the pairs of (start_node, target_node) with an interior loop over the neighbor nodes of start_node, and we name update_Q().

    Experiments and Outcomes

    Let’s begin with a easy graph of 12 nodes, with directed weighted edges.

    Determine 1: A graph of 12 nodes. Picture by the creator.

    We will observe from Determine 1 that the one incoming node to Node-1 is Node-7, and the one incoming node to Node-7 is Node-1. Subsequently, no different node in addition to these two can attain Node-1 and Node-7. When one other node is tasked with sending a message to Node-1 or Node-7, the message will bounce across the graph till the utmost variety of hops is reached. We count on to see giant adverse Q-values in these circumstances.

    When we train our graph, we get statistics about the associated fee and the variety of hops as a perform of the epoch, as in Determine 2:

    Determine 2: Typical evolution of the associated fee and the trail lengths (variety of hops) as a perform of the epoch. Picture by the creator.

    After coaching, that is the Q-matrix for Node-4:

    Desk 1: Q-matrix for Node-4. Picture by the creator.

    The trajectory from Node-4 to, say, Node-11, might be obtained by calling the trajectory() technique, setting epsilon=0 for the grasping deterministic answer: [4, 3, 5, 11] for a complete undiscounted price of 0.9 + 0.9 + 0.3 = 2.1. The Dijkstra algorithm returns the identical path.

    In some uncommon circumstances, the optimum path was not discovered. For instance, to go from Node-6 to Node-9, a selected occasion of the educated Q-graph returned [6, 11, 0, 4, 10, 2, 9] for a complete undiscounted price of three.5, whereas the Dijkstra algorithm returned [6, 0, 4, 10, 2, 9] for a complete undiscounted price of three.4. As we said earlier than, that is anticipated from an iterative algorithm

    Conclusion

    We formulated the Small-World Experiment as an issue of discovering the trail with minimal price between any pair of nodes in a sparse directed graph with weighted edges. We applied the nodes as Q-Studying brokers, who study by way of the replace rule to take the least pricey motion, given a goal node.

    With a easy graph, we confirmed that the coaching settled to an answer near the optimum one.

    Thanks on your time, and be happy to experiment with the code. You probably have concepts for enjoyable purposes for Q-Studying, please let me know!


    1 OK, I’m going past the unique Small-World Experiment, which must be known as the Small-Nation Experiment.

    References

    Reinforcement Studying, Richard S. Sutton, Andrew G. Barto, MIT Press, 1998



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSMART launches new Wearable Imaging for Transforming Elderly Care research group | MIT News
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    SMART launches new Wearable Imaging for Transforming Elderly Care research group | MIT News

    February 3, 2026
    Artificial Intelligence

    YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger

    February 3, 2026
    Artificial Intelligence

    Creating a Data Pipeline to Monitor Local Crime Trends

    February 3, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Google DeepMind’s new AI can help historians understand ancient Latin inscriptions

    July 23, 2025

    7 Proven Methods to Customizing and Optimizing Speech Data Collection for AI/ML

    April 9, 2025

    BBC Uses AI to Resurrect Agatha Christie as Your Personal Writing Coach

    May 1, 2025

    Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning

    May 1, 2025

    Wondershare Filmora: Features, Benefits, Review and Alternatives

    December 5, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    The Complete Guide to NetSuite SuiteScript

    April 4, 2025

    China Unveils World’s First AI Hospital: 14 Virtual Doctors Ready to Treat Thousands Daily

    May 7, 2025

    Data Science in 2026: Is It Still Worth It?

    November 28, 2025
    Our Picks

    Routing in a Sparse Graph: a Distributed Q-Learning Approach

    February 3, 2026

    SMART launches new Wearable Imaging for Transforming Elderly Care research group | MIT News

    February 3, 2026

    YOLOv2 & YOLO9000 Paper Walkthrough: Better, Faster, Stronger

    February 3, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.