Conversation
Much easier to pick one to debug this way
We can safely use an unbounded `@cache`, because there can only be 16 valid pairs
As much as is possible without #3396
Need to decide how many of the others to leave as todos Main theme is needing `get_supertype` (#3396)
Everything left requires `get_supertype` (#3396)
* refactor: Replace `_same_supertype` with a custom `@singledispatch` This is more generally useful and a LOT easier to read from the outside * refactor: Just use a real class * fix(typing): Satisfy `mypy` * fix: Oops forgot the first element * refactor(typing): Use slightly better names * chore: Rename `default` -> `upper_bound` * docs: Replace debugging doc * docs: More cleanup * refactor: Use `__slots__`, remove a field * docs: More, more cleanup * docs: lil bit of `.register` progress * cov * test: Get full coverage for `@just_dispatch` * chore: Give it a simple repr * test: Oops, forgot that was an override * revert: Keep only what is required See #3396 (comment) * refactor: Simplify `@just_dispatch` signature * fix(typing): Satisfy mypy * test: Gotta get that coverage Resolves #3410 (comment) * docs: Restore a minimal version of `@just_dispatch` doc Resolves #3410 (comment) * revert: Remove `Impl` alias #3410 (comment) * refactor: Rename `Passthrough` -> `PassthroughFn` Suggested in #3410 (review) * docs: Add note to use only on internal Suggested in #3410 (review)
MarcoGorelli
left a comment
There was a problem hiding this comment.
Are we sure we should be doing this?
I don't think different libraries follow the same supertyping rules, and i'm not sure it's something we should impose
e.g. Datetime('us') vs Datetime('ns'): Polars goes to the former, pandas to the latter
In [16]: df = pl.DataFrame({'a': [datetime(2020,1,1)]})
In [17]: pl.concat([df.with_columns(pl.col('a').cast(pl.Datetime('ns'))), df], how='vertical_relaxed')
Out[17]:
shape: (2, 1)
┌─────────────────────┐
│ a │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2020-01-01 00:00:00 │
│ 2020-01-01 00:00:00 │
└─────────────────────┘
In [18]: pd.concat([df.with_columns(pl.col('a').cast(pl.Datetime('ns'))).to_pandas(), df.to_pandas()], axis=0).dtypes
Out[18]:
a datetime64[ns]
dtype: object
But doesn't that inconsistency show an example of how - if we don't address it - there's a knock-on effect to things like selectors? IMO, (#3396 (review)) is the kind of thing that won't be an issue to most use cases - but when it is, it could be a slog to debug. I wanna stress that my goal is a set of rules.
I like the rules we have here, but I'm still open to more fiddling 🙂 |
|
i think it could also be a slog to debug when someone switches from pandas (or some other library) to narwhals for selectors, i think it's fairly common to select from kind (like all temporal columns, or all datetime ones) rather than some exact dtype (like |
|
@MarcoGorelli I find the polars behavior a bit odd. I didn't check what other backends do nor I could find a polars issue on the topic. What would you propose to do here? I guess one option is that we start by not allowing supertyping for datetime and duration dtypes unless they have the same time_unit. That's probably the safest approach to begin with, as a user can always decide how to do it externally if needed |
|
Yup, happy for supertyping datetimes of different resolutions to raise Is this mostly about int32 vs int64 -> int64 kind of operations? If so, I think those at least should be fairly standardised, ok with doing those. Is there any other kind of supertyping that this PR does? sorry i
haven't clicked through everything |
|
Thanks @MarcoGorelli
Alright, I think that can be a starting point. @dangotbanned WDYT?
Yes I feel you, this is definitely a large one with a lot of commits. I guess the quickest way to get a grasp of it is the documentation page we have written, or even quicker a chart. The TL;DR is that:
|
|
thanks! 🙏 i've read through, and my initial reaction is that this is too complicated why are we dealing with String vs Int64 for example? do we have a use-case for that? |
|
To me the buggy part is that in the current implementation of concat every backend has a different behavior (#3191 (comment)). This PR is a pre requisite to have a consistent behavior:
Coming to:
casting numeric to string is standardized across backends, I don't see why that would be problematic to support. If you are up for it, let's have a call chat to better understand what we could land with this PR |
Yep the idea is we can allow more things if we can have consistent semantics by casting first. I made a list of other Where we can apply the rulesI think that in a lot of these cases, we wouldn't have coverage (understandably) for what each backend does on it's own. Top-level functions
|
|
are you suggesting that In [55]: df = duckdb.sql("""select * from values (1.5), (2.5) df(a)""")
In [56]: duckdb.sql("""
...: from df
...: select a + a
...: """)
Out[56]:
┌──────────────┐
│ (a + a) │
│ decimal(3,1) │
├──────────────┤
│ 3.0 │
│ 5.0 │
└──────────────┘
In [57]: df.pl().select(pl.col('a')+pl.col('a'))
Out[57]:
shape: (2, 1)
┌───────────────┐
│ a │
│ --- │
│ decimal[38,1] │
╞═══════════════╡
│ 3.0 │
│ 5.0 │
└───────────────┘
In [58]: df.pl()
Out[58]:
shape: (2, 1)
┌──────────────┐
│ a │
│ --- │
│ decimal[2,1] │
╞══════════════╡
│ 1.5 │
│ 2.5 │
└──────────────┘and i'm not sure we should be standardising it |
All I'm suggesting is that the list in (#3396 (comment)) is where those rules could be applied. I know that |
|
@MarcoGorelli you could pick any combination of types/backends and find examples of incompatibilities. I don't think it is helpful, considering where I started (#3396 (comment)):
I'm gonna try a different angle ... (Provided we can cast our way there) What do you think about using <insert-a-set-of-rules-here> in places where a backend would otherwise error? I think that you're okay with that, and these overlap a lot with (#3396 (comment)):
The main motivator for this PR is supporting another of these cases (#3398), but wanting to do it in a standardised way. I would like it if you could write this and it is reliable: nw.concat(items, how="vertical_relaxed")I know that we can do this. |
|
Just so i understand
|
General
The interesting part is that supertyping reduces them all to the same problem.
How often we'd benefit depends on two things:
ExamplesI've picked the first example at random. Note There are lots of code blocks hidden here! 1 -
|
|
This is a big push towards "Narwhals to ensure consistent backend behavior" rather than "Narwhals to let backends do whatever they wish" I am onboard with the reasoning for this PR. However I have 2 concerns:
|
This comment was marked as outdated.
This comment was marked as outdated.
|
Sorry for the delay @camriddell! I really do appreciate the time you put into (#3396 (comment)) ❤️ I wanted to circle-up with @FBruzzesi first, so we could avoid adding too much to the thread (
Consistency is definitely something I'd like to see more of 1, but understand that everyone has unique expectations on how far that should go.
Thank you 😍 Backwards compatibility
GeneralI want this feature to preserve backwards compatibility in existing APIs. (excluding (#3398)), would you like to see something more concrete than the end of my summary here? I have a few ideas.
The main
Yeah this isn't implemented yet (see usage in (#3398) So all together - if we want configuration - I'd be thinking of this as a jumping off point: Show
|
Maintainability
100% on board with the motivation! If there is an apetite for making this configurable (#3396 (comment)), then explaining the tricky bits inline would be my preference (for now). The current impl is under the assumption that valid supertypes are fixed and leans into that pretty heavily. |
LOC asideI feel the need to clear this up, but don't want this to distract from (#3396 (comment)) 🙂
If we go purely by the full diff, okay yes there is a big +1700. However, this covers most of the source LOC changes:
We can reduce the diff by splitting out IMO, those are pretty good figures for a feature that every backend could use in |
camriddell
left a comment
There was a problem hiding this comment.
A few minor points & questions on specific code pieces. Nothing high-level.
| left_fields, right_fields = left.fields, right.fields | ||
| if len(left_fields) != len(right_fields): | ||
| return _struct_fields_union(left_fields, right_fields) | ||
| new_fields = deque["Field"]() |
There was a problem hiding this comment.
Why use a deque here? It seems that we're only .appending to the object, so a list should be okay?
Also, is there a reason for the typing syntax on the right-side of the assignment?
| left: Collection[Field], right: Collection[Field], / | ||
| ) -> Struct | None: | ||
| """Adapted from [`union_struct_fields`](https://github.com/pola-rs/polars/blob/c2412600210a21143835c9dfcb0a9182f462b619/crates/polars-core/src/utils/supertype.rs#L559-L586).""" | ||
| longest, shortest = (left, right) if len(left) >= len(right) else (right, left) |
There was a problem hiding this comment.
Minor (feel free to ignore/reject), but perhaps this a bit more intent-ful:
shortest, longest = sorted([left, right], key=len)
| for left_f, right_f in zip(left_fields, right_fields): | ||
| if left_f.name != right_f.name: | ||
| return _struct_fields_union(left_fields, right_fields) | ||
| if supertype := get_supertype(left_f.dtype(), right_f.dtype()): |
There was a problem hiding this comment.
does this path do any less work than just always calling _struct_fields_union? It feels like the bulk of this entire function could just return _struct_fields_union
There was a problem hiding this comment.
I had to stare at this for a while again to see it 😂
I'm gonna definitely add some comments, thanks @camriddell
So the other path is optimized for merging both dtype and name differences.
This one bails on the first name mismatch.
does this path do any less work
If we can avoid the name stuff, simply calling get_supertype a bunch can be quite cheap:
- lots of it is
frozensetops anddictlookups - complex cases are aggresively cached 😅
Oh, and the other path requires creating and incrementally building up dict
narwhals/narwhals/dtypes/_supertyping.py
Line 206 in dd14fd1
|
If we avoid overriding the behavior of a backend and only use this feature in the cases where the backend provides no alternative, then the current plan for limited application (e.g. My primary concern (this extends beyond this PR, so take with a grain of salt) is that a native backend comes out with a feature that we've already bolted on top in the Narwhals API. What is the future of this feature? Do we continue to route users through our own implementation? Do we adopt the upstream feature with a version check? Will users be surprised if their results differ depending on an interaction between the version of narwhals and the version of their backend? The above questions don't need to be answered for this PR, but just highlights where my perspective originates :) Some follow ups from previous discussion
Had some time to take a closer look at the code and I think it's pretty readable (to me)! The main entrypoint is
Point well-taken. The implementation is only a few hundred lines and we do stand to get future re-use out of this.
I think hatches at the call-site would be a bit better than passing messages into/out-of |
Description
Important
@FBruzzesi and I have been + are still iterating on this
Core functionality is there, focusing on readability, performance + shrinking the test suite
This PR implements
polars' concept of supertyping - which more generally defines which types can be safely promoted/demoted/cast to other types.I really like the DuckDB visualization of their version1 of these rules, so here's that for an example:
Show Casting Operations Matrix
This is a preliminary step for implementing relaxed
concat(#3386).The aim is we own a consistent set of rules that all/most backends can participate in.
We've already dropped some supertypes that are valid in
polars, but may prove challenging in other backends such as #121.Some others are directly mentioned in comments (e.g.
(Struct, DType) -> Struct)Additional use-cases
Supertyping in
polarsis used for much more than just a subset ofconcat.In (#2572), it is one of the larger concepts missing from the intermediate representation (see #3386 (comment)).
polars-plan::plans::conversion::type_coercionis full of examples of how deeply related the concept is with expressions.My aim is not to reproduce all of that 😅 - but to be able to reason about
DTypes betweenLazyFrameoperations without querying the backend for aSchemabetween every step 🤞Related issues
concat(..., how={"vertical_relaxed", "diagonal_relaxed"})#3386DType.__call__#3393DecimalDType #3377Tasks
NOTSET(244537d)443d9ccd)StructDTypefixtures(v1.Datetime, Datetime) -> None_CACHE_SIZE_TP_HIGHnarwhals.dtypes.classes->narwhals.dtypes._classes(d2c96fe)narwhals.stable.v1._dtypes->narwhals.dtypes._classes_v1_has_intersection_first_excludingDecimalhandling #3377Stringdowncasts (see thread) 8d9e053(Nested, String)?{List, Array} -> List3ad1639promotion-rules.mdscriptpromotion-rules.mdget_supertype#3396 (comment)get_supertype#3396 (comment)_mixed_supertypecomments_numeric_supertypecommentsStructconcat(..., how="*_relaxed"})#3398@lru_cacheon a wrapper forDType.__eq___struct_fields_unionandget_supertypeFootnotes
DuckDB also mentions another set of rules called Combination Casting - that is entirely implicit.
The matrix doesn't relfect these and only one cast example is given, but it would apply to
nw.concat:"This combination casting occurs for ..., set operations (
UNION/EXCEPT/INTERSECT), and ..." ↩