Scroll to navigation

CONFUSABLE_HOMOGLYPHS(1) confusable_homoglyphs CONFUSABLE_HOMOGLYPHS(1)

NAME

confusable_homoglyphs - confusable_homoglyphs Documentation

Contents:

CONFUSABLE_HOMOGLYPHS [DOC]

This project has been adopted from the original confusable_homoglyphs by Victor Felder.

a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph

Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.

  • AlaskaJazz is single script: only Latin characters.
  • ΑlaskaJazz is mixed-script: the first character is a greek letter.

You might also want to avoid people being tricked into entering their password on www.microsоft.com or www.faϲebook.com instead of www.microsoft.com or www.facebook.com. Here is a utility to play with these confusable homoglyphs.

Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.

  • Allo and ρττ are fine: single script.
  • AlloΓ is fine when our preferred script alias is ‘latin’: mixed script, but Γ is not confusable.
  • Alloρ is dangerous: mixed script and ρ could be confused with p.

This library is compatible with Python 3.

API documentation

Is the data up to date?

Yep.

The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.

The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.

This data is stored in two JSON files: categories.json and confusables.json. If you delete them, they will both be recreated by downloading and parsing the two abovementioned files and stored as JSON files again.

INSTALLATION

If available, install an appropriate package from your distribution:

Otherwise you can install from PyPi:

at the command line:

$ easy_install confusable_homoglyphs


or, if you have virtualenvwrapper installed:

$ mkvirtualenv confusable_homoglyphs
$ pip install confusable_homoglyphs


USAGE

To use confusable_homoglyphs in a project:

pip install confusable_homoglyphs
import confusable_homoglyphs


To update the data files, you first need to install the “cli” bundle, then run the “update” command:

pip install confusable_homoglyphs[cli]
confusable_homoglyphs update


API DOCUMENTATION

confusable_homoglyphs package

Submodules

confusable_homoglyphs.categories module

Retrieves the script block alias for a unicode character.

>>> categories.alias('A')
'LATIN'
>>> categories.alias('τ')
'GREEK'
>>> categories.alias('-')
'COMMON'
    
chr (str) – A unicode character
The script block alias.
str


Retrieves the script block alias and unicode category for a unicode character.

>>> categories.aliases_categories('A')
('LATIN', 'L')
>>> categories.aliases_categories('τ')
('GREEK', 'L')
>>> categories.aliases_categories('-')
('COMMON', 'Pd')
    
chr (str) – A unicode character
The script block alias and unicode category for a unicode character.
(str, str)


Retrieves the unicode category for a unicode character.

>>> categories.category('A')
'L'
>>> categories.category('τ')
'L'
>>> categories.category('-')
'Pd'
    
chr (str) – A unicode character
The unicode category for a unicode character.
str


Retrieves all unique script block aliases used in a unicode string.

>>> categories.unique_aliases('ABC')
{'LATIN'}
>>> categories.unique_aliases('ρAτ-')
{'GREEK', 'LATIN', 'COMMON'}
    
string (str) – A unicode character
A set of the script block aliases used in a unicode string.
(str, str)


confusable_homoglyphs.cli module

Generates the categories JSON data file from the unicode specification.
True for success, raises otherwise.
bool


Generates the confusables JSON data file from the unicode specification.
True for success, raises otherwise.
bool


confusable_homoglyphs.confusables module


Checks if string contains characters which might be confusable with characters from preferred_aliases.

If greedy=False, it will only return the first confusable character found without looking at the rest of the string, greedy=True returns all of them.

preferred_aliases=[] can take an array of unicode block aliases to be considered as your ‘base’ unicode blocks:

considering paρa,
  • with preferred_aliases=['latin'], the 3rd character ρ would be returned because this greek letter can be confused with latin p.
  • with preferred_aliases=['greek'], the 1st character p would be returned because this latin letter can be confused with greek ρ.
  • with preferred_aliases=[] and greedy=True, you’ll discover the 29 characters that can be confused with p, the 23 characters that look like a, and the one that looks like ρ (which is, of course, p aka LATIN SMALL LETTER P).


>>> confusables.is_confusable('paρa', preferred_aliases=['latin'])[0]['character']
'ρ'
>>> confusables.is_confusable('paρa', preferred_aliases=['greek'])[0]['character']
'p'
>>> confusables.is_confusable('Abç', preferred_aliases=['latin'])
False
>>> confusables.is_confusable('AlloΓ', preferred_aliases=['latin'])
False
>>> confusables.is_confusable('ρττ', preferred_aliases=['greek'])
False
>>> confusables.is_confusable('ρτ.τ', preferred_aliases=['greek', 'common'])
False
>>> confusables.is_confusable('ρττp')
[{'homoglyphs': [{'c': 'p', 'n': 'LATIN SMALL LETTER P'}], 'alias': 'GREEK', 'character': 'ρ'}]
  • string (str) – A unicode string
  • greedy (bool) – Don’t stop on finding one confusable character - find all of them.
  • preferred_aliases (list(str)) – Script blocks aliases which we don’t want string’s characters to be confused with.

False if not confusable, all confusable characters and with what they are confusable otherwise.
bool or list


Checks if string can be dangerous, i.e. is it not only mixed-scripts but also contains characters from other scripts than the ones in preferred_aliases that might be confusable with characters from scripts in preferred_aliases

For preferred_aliases examples, see is_confusable docstring.

>>> bool(confusables.is_dangerous('Allo'))
False
>>> bool(confusables.is_dangerous('AlloΓ', preferred_aliases=['latin']))
False
>>> bool(confusables.is_dangerous('Alloρ'))
True
>>> bool(confusables.is_dangerous('AlaskaJazz'))
False
>>> bool(confusables.is_dangerous('ΑlaskaJazz'))
True
    
  • string (str) – A unicode string
  • preferred_aliases (list(str)) – Script blocks aliases which we don’t want string’s characters to be confused with.

Is it dangerous.
bool


Checks if string contains mixed-scripts content, excluding script blocks aliases in allowed_aliases.

E.g. B. C is not considered mixed-scripts by default: it contains characters from Latin and Common, but Common is excluded by default.

>>> confusables.is_mixed_script('Abç')
False
>>> confusables.is_mixed_script('ρτ.τ')
False
>>> confusables.is_mixed_script('ρτ.τ', allowed_aliases=[])
True
>>> confusables.is_mixed_script('Alloτ')
True
    
  • string (str) – A unicode string
  • allowed_aliases (list(str)) – Script blocks aliases not to consider.

Whether string is considered mixed-scripts or not.
bool


confusable_homoglyphs.utils module

Deletes a JSON data file if it exists.



Loads a JSON data file.
A dict.
dict


Returns a file path relative to the data directory.

This is the package directory by default, or the env variable CONFUSABLE_DATA if set.

A file path string.
str



Module contents

CONTRIBUTING

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://todo.sr.ht/~valhalla/confusable_homoglyphs

If you are reporting a bug, please include:

  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the sourcehut tickets for bugs. Anything tagged with “bug” is open to whoever wants to implement it.

Implement Features

Look through the sourcehut tickets for features. Anything tagged with “feature” is open to whoever wants to implement it.

Write Documentation

confusable_homoglyphs could always use more documentation, whether as part of the official confusable_homoglyphs docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://todo.sr.ht/~valhalla/confusable_homoglyphs.

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up confusable_homoglyphs for local development.

1.
Clone the git repository from sourcehut:

2.
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

$ mkvirtualenv confusable_homoglyphs
$ cd confusable_homoglyphs/
$ python setup.py develop


3.
Create a branch for local development:

$ git checkout -b name-of-your-bugfix-or-feature


Now you can make your changes locally.

4.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

$ flake8 confusable_homoglyphs tests
$ python setup.py test
$ tox


To get flake8 and tox, just pip install them into your virtualenv.

5.
Commit your changes:

$ git add .
$ git commit -m "Your detailed description of your changes."



7.
Send the patch to mailto:~valhalla/confusable_homoglyphs-devel@lists.sr.ht:

$ git send-email \

--to="mailto:~valhalla/confusable_homoglyphs-devel@lists.sr.ht" \
HEAD^


you can see https://git-send-email.io/ for details on how to install and configure git-send-email.


Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

1.
The pull request should include tests.
2.
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
3.
The pull request should work for all supported Python versions.

CREDITS

Original Author and Former Maintainer

Victor Felder <victorfelder@gmail.com>

Current Maintainer

Elena “of Valhalla” Grandi <valhalla@trueelena.org>

Contributors

Ryan P Kilby <rpkilby@ncsu.edu>

HISTORY

1.0.0

Initial release.

2.0.0

allowed_categories renamed to allowed_aliases

2.0.1


3.0.0

Courtesy of Ryan P Kilby, via https://github.com/vhf/confusable_homoglyphs/pull/6 :

  • Changed file paths to be relative to the confusable_homoglyphs package directory instead of the user’s current working directory.
  • Data files are now distributed with the packaging.
  • Fixes tests so that they use the installed distribution instead of the local files. (Originally, the data files were erroneously showing up during testing, despite not being included in the distribution).
  • Moves the data file generation into a simple CLI. This way, users have a method for controlling when the data files are updated.
  • Since the data files are now included in the distribution, the CLI is made optional. Its dependencies can be installed with the cli bundle, eg. pip install confusable_homoglyphs[cli].

3.1.0

Update unicode data

3.1.1

Update unicode data (via ftp)

3.2.0

  • Drop support for Python 3.3
  • Fix #11: work as expected when char not found in datafiles

3.3.0

  • Drop support for Python 2
  • Drop support for Python < 3.7, add support for Python up to 3.12
  • Allow using data files from a custom location set with the CONFUSABLE_DATA environment variable.
  • Fix the return value of confusables.is_dangerous() to the documented API of a boolean value. It used to return either False or the list output of confusable.is_confusable().
  • Added a check command for command line use.

3.3.1

Update unicode data

AUTHOR

Victor Felder

COPYRIGHT

2024, Victor Felder

January 30, 2024 3.3.1