`map` is an ordinal structure with log(n) time searches.
`unordered_map` uses O(1) average-time searches and O(n) in the worst
case where a bucket has a to a colliding hash and has to start chaining.
The unordered version should speed up our general-case when looking up
constants.
I've added a trivial order-dependent(_(0,1) and (1,0) will return a
different hash_) hash to combine a 128-bit constant into a
64-bit hash that generally will not collide, using a bit-rotate to
preserve entropy.
In MSVC, having files with identical filenames will result into massive slowdowns when compiling.
The approach I have taken to resolve this is renaming the identically named files in frontend/(A32, A64) to (a32, a64)_filename.cpp/h
This makes dynarmic installable, and also adds a CMake package config
file, that allows projects to use `find_package(dynarmic)` to import the
library.
I know #636 adds the same thing, but while experimenting with the
different install options in
https://github.com/merryhime/dynarmic/pull/636#discussion_r725656034
I ended up with a working patch, so I'm proposing this as well. This
implements solution 2.
This adds versioning information to the built library.
When building the shared library on Linux systems, a new object will
be created: libdynarmic.so.5
This is really useful when talking about ABI compatibility.
The variables dynarmic_VERSION and dynarmic_VERSION_MAJOR
are implicitly created when calling project(dynarmic VERSION x.y.z)
Adds all elements of vector and puts the result into the lowest element.
Accelerates the `addv` instruction into a vectorized implementation
rather than a serial one.
The lane-splatting variant of `FMUL` and `FMLA` is very
common in instruction streams when implementing things like
matrix multiplication. When used, they are used very densely.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication
The way this is currently implemented is by grabbing the particular lane
into a general purpose register and then broadcasting it into a simd
register through `VectorGetElement` and `VectorBroadcast`.
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcast(esize, v.ir.VectorGetElement(esize, v.V(idxdsize, Vm), index));
```
What could be done instead is to keep it within
the vector-register and use a permute/shuffle to "splat" the particular
lane across all other lanes, removing the GPR-round-trip.
This is implemented as the new IR instruction `VectorBroadcastElement`:
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcastElement(esize, v.V(idxdsize, Vm), index);
```
Recursive calls to `Replicate` beyond the first call might
cause an unintentional up-casting to an `int` type due
to `|` and `<<` operations on types such as `uint8_t` and `uint16_t`
This makes sure calls such as `Recursive<u8>` stay as the `u8` type
through-out.
xbyak is intended to be installed in /usr/local/include/xbyak.
Since we desire not to install xbyak before using it, we copy the headers
to the appropriate directory structure and use that instead
AVX512 introduces the _unsigned_ variant of float-to-integer conversion
functions via `vcvttp{sd}2u{dq}q`. In the case that a value is not
representable as an unsigned integer, it will result in `0xFFFFF...`
which can be utilized to get "free" saturation when the floating point
value exceeds the unsigned range, after masking away negative values.
https://www.felixcloutier.com/x86/vcvttps2udqhttps://www.felixcloutier.com/x86/vcvttpd2uqq
This PR also speeds up the _signed_ conversion function for fp64->int64
https://www.felixcloutier.com/x86/vcvttpd2qq
And(a, Not(b)) is a common enough operation that this can
be fused into a single `AndNot` operation. On x64 this is also
a single `pandn` instruction rather than two.
This implementation exists within the unsafe optimization paths and
utilize the 14-bit-precision `vrsqrt14*` and `vrcp14p*`
instructions provided by AVX512F+VL. These are _more_ accurate than
the fallback path and the current `rsqrt`-based unsafe code-path
but still falls in line with what is expected of the
`Unsafe_ReducedErrorFP` optimization flag.
Having AVX512 available will mean this function has 14 bits of precision.
Not having AVX512 available will mean these functions have 11 bits of precision.
Removes dependency on the constants at the top of some files
such as `f16_negative_zero` and `f32_non_sign_mask` in favor
of the `FPInfo` trait-type.
Also removes bypass delays by selecting between instructions
such as `pand`, `andps`, or `andpd` depending on the type
and keeps them in their respective uop domain.
See https://www.agner.org/optimize/instruction_tables.pdf for
more info on bypass delays.
Currently, every usage of `gf2p8affineqb` is guarded by the
`AVX512F + AVX512VL + GFNI` requirement, when really
we only need `GFNI` on its own.
This will allow `GFNI`-only chips to get emit GFNI features without
needing to have AVX512 as well.
There _are_ chips in existance currently that strictly ship with GFNI and
have no implementation of AVX1/AVX2/AVX512(and thus no VEX/EVEX
encoding) such as Tremont(Lakefield) chips.
We might want to allocate different sizes for each of them.
e.g. for the unsafe fastmem approach without bounds checking.
Or for using the full 48bit adress range (with mirrors) by allocating our real arena as close to 1<<47 as possible.