banner
Nanomoa

Nanomoa's Blog

"Bytes and Brackets: My Journey in Go and C++ Exploration."
github
zhihu
telegram
twitter

Jaccard similarity coefficient

Rambling#

Recently, I have been working on a course project for data structures. Our group's topic is the campus tour system. Because we are not very skilled, we couldn't come up with a good idea 😭. While searching for information (snooping on other people's ideas), I came across some introductions to recommendation algorithms, so I added them to our project.

Jaccard Similarity Coefficient#

Definition#

Jaccard(A,B)=ABAB=ABA+BABJaccard(A, B) = \frac{ | A \cap B | }{ | A \cup B | } = \frac{ | A \cap B | }{ | A | + | B | - | A \cap B | }, if A=A = \varnothing and B=B = \varnothing, Jaccard(A,B)=1Jaccard(A, B) = 1

Note: Jaccard(A,B)[0,1]Jaccard(A, B) \in [0, 1]

Jaccard Distance#

Used to describe the dissimilarity between two sets

DistJaccard=1Jaccard(A,B)=AΔBABDist_{Jaccard} = 1 - Jaccard(A, B) = \frac{ A \Delta B }{ | A \cup B | }

Simple Recommendation Algorithm Implementation#

Definition of the original building/spot node:#

struct Location {
    double longitude;
    double latitude;

    [[nodiscard]] std::string toString() const {
        return std::to_string(longitude) + "," + std::to_string(latitude);
    }
};

template<typename IdType>
struct Spot {
    IdType id;
    std::string title;
    std::string description;
    Location location{};
};

Implementation Idea#

The goal is to recommend other similar buildings based on the current building. So, I thought of adding a tags field (set) to the Spot structure and assigning some tags to each building based on its usage, such as "teaching building", "laboratory building", "general course", "professional course", "theory course", "lab course", etc., using the Jaccard similarity coefficient as the measure of similarity.

For each building, recommend buildings with a similarity greater than similarityThreshold (minimum similarity), and prioritize buildings with smaller distances. The value of similarityThreshold needs to be determined considering the overall number of tags.

Definition after modification:#

struct Location {
    double longitude;
    double latitude;

    [[nodiscard]] std::string toString() const {
        return std::to_string(longitude) + "," + std::to_string(latitude);
    }
};

template<typename IdType>
struct Spot {
    IdType id;
    std::string title;
    std::unordered_set<std::string> tags;
    std::string description;
    Location location{};
};

Code Implementation:#

template<typename IdType>
class Recommend {
private:
    // Calculate Jaccard similarity coefficient
    static double jaccardSimilarity(
            const std::unordered_set<std::string>& set1,
            const std::unordered_set<std::string>& set2
    ) {
        std::unordered_set<std::string> intersection;
        std::unordered_set<std::string> union_set;

        std::copy_if(set1.begin(), set1.end(),
                     std::inserter(union_set, union_set.end()),
                     [&](const std::string& s) {
                         return set2.find(s) != set2.end();
                     }
        );

        std::set_intersection(set1.begin(), set1.end(),
                              set2.begin(), set2.end(),
                              std::inserter(
                                      intersection,
                                      intersection.end()
                              )
        );

        std::set_union(set1.begin(), set1.end(),
                       set2.begin(), set2.end(),
                       std::inserter(
                               union_set,
                               union_set.end()
                       )
        );

        return static_cast<double>(intersection.size()) /
               static_cast<double>(union_set.size());
    }

public:
    static std::vector<std::pair<IdType, double>> recommendSimilarSpots(
            Graph<IdType> graph,
            const Spot<IdType>& referenceSpot,
            double similarityThreshold = 0.3
    ) {
        std::vector<std::pair<IdType, double>> similarityScores;

        for (const auto& spotPair : graph.getSpots()) {
            const auto& spot = spotPair.second;
            if (spot.id != referenceSpot.id) {
                double similarity = jaccardSimilarity(
                        referenceSpot.tags,
                        spot.tags
                );
                if (similarity >= similarityThreshold) {
                    similarityScores.push_back(
                            std::make_pair(spot.id, similarity)
                    );
                }
            }
        }

        Sort(similarityScores.begin(), similarityScores.end(),
             [&graph, &referenceSpot](
                     const std::pair<IdType, double>& a,
                     const std::pair<IdType, double>& b
             ) {
                 if (a.second == b.second) {
                     return distanceHaversine(referenceSpot.location, graph.getSpot(a.first).location) <
                            distanceHaversine(referenceSpot.location, graph.getSpot(b.first).location);
                 }
                 return a.second > b.second;
             }
        );

        return similarityScores;
    }
};
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.